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An ash I know there stands, 
Yggdrasill is its name, 

a tall tree, showered 

with shining loam. 

From there come the dews 
that drop in the valleys. 

It stands forever green over 
Urodr’s well. 
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Preface 


I have often received comments that my published papers are spread all over the place, and that 
it is necessary to read a large subset of my papers before they individually begin to make sense. 
There is also the problem that many of my key papers were published in conference proceedings, 
or not published at all, which makes it difficult to obtain copies of them. These are the main 
reasons for the creation of this set of 4 volumes, which collect all of my papers so that they can 
be conveniently accessed alongside each other. There is some duplication of material, where for 
instance a technical report is very similar to a subsequently published paper, but I have included 
everything for completeness. The material is split between the 4 volumes as follows: 


1. Published Papers — journal papers, conference papers, and book chapters. 


2. Reports : Part 1 — Royal Signals and Radar Establishment (RSRE), Defence Research Agency 
(DRA), and Defence Evaluation and Research Agency (DERA) technical reports and research 
notes. 


3. Reports : Part 2 — QinetiQ technical reports. 


4. Unpublished Papers — mostly arXiv papers. 


I worked as a British scientific civil servant for about 20 years, and during that time I published a lot 
of scientific papers all of which are British Crown copyright. Fortunately, Her Majesty’s Stationery 
Office (HMSO) has very liberal views on allowing the reproduction of Crown copyright material. 
This has made possible the creation of this collection of papers which contains retypeset versions 
of all of my publications. 


The route by which I produced this set of volumes was long and torturous — starting sometime in 
the 1990s, and continuing off-and-on (more off than on) up to the publication of this set of volumes. 
I experimented with many different software applications for processing the retypeset publications, 
but I ultimately converged on using AXTRX — via the LyX front-end to ETEX — as being the only 
document processor that could easily create the high-quality results that I needed. I should have 
started doing things this way on “day one”, but I wrongly believed that there must be a better 
approach, so I went off to explore the universe of alternative possibilities before finally returning to 


ISTRX. 
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Introduction 


The motivation for the research that is described in these volumes is the wish to explain things in 
terms of their underlying causes, rather than merely being satisfied with phenomenological descrip- 
tions. When this reductionist approach is applied to information processing it allows the internal 
structure of information to be analysed, so information processing algorithms can then be derived 
from first principles. 


One of the simplest examples of this approach is the diagonalisation of a data covariance matrix — 
there are many variants of this basic approach, such as singular value decomposition — in which the 
assumed independent components of high-dimensional data are identified and extracted. The main 
limitation of this type of information analysis approach is that it is based on linear algebra applied 
globally to the data space, so it is unable to preserve information about any local data structure in 
the data space. For instance, if the data lives on a low-dimensional curved manifold embedded in 
the data space, then only the global properties of this manifold would be preserved by global linear 
algebra methods. 


In practice, data whose high-dimensional structure is non-trivial typically lives on a noisy version 
of a curved manifold, so techniques for analysing such data must automatically handle this type of 
structure. For instance, a blurred image of a point source is described by its underlying degrees 
of freedom — i.e. the position of the source — and as the source moves about it generates a curved 
manifold that lives in the high-dimensional space of pixel values of the sampled image. The basic 
problem is then to deduce the internal properties of this manifold by analysing examples of such 
images. A more challenging problem would be to extend this analysis to images that contain several 
overlapping blurred images of point sources, and so on. There is no limit to the complexity of the 
types of high-dimensional data that one might want to analyse. 


These methods then need to be automated so that they do not rely on human intervention, which 
would then allow them to be inserted as “components” into information processing networks. The 
purpose of the research that is described in these volumes is to develop principled information 
processing methods that can be used for such analysis. Self-organising information processing 
networks arise naturally in this context, in which ways of cutting up the original manifold into 
simpler pieces emerge automatically. 
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CLUSTER DECOMPOSITION OF PROBABILITY DENSITY FUNCTIONS * 


S. P. Luttrell 
Royal Signals and Radar Establishment, St Andrews Road, Malvern, WORCS, WR14 3PS, UK 


We derive a hierarchical cluster decomposition of joint probability density functions. This is 
realised in a multi-layer topographic neural network structure. Possible applications in image pro- 
cessing include clutter modelling and target detection. 


Summary 


A problem which is frequently encountered in the Bayesian analysis of images (or, more generally, patterns) is 
the choice of a suitable representation of probability density functions (PDF). We shall concentrate on network 
representations which are acquired as a result of a training process. Adaptive Markov random fields (MRF) provide 
one such approach, and the Boltzmann machine is a familiar example of an adaptive MRF. A limitation of the MRF 
approach to image analysis is the need to perform extensive numerical simulations in order to extract the simplest 
results. We therefore seek an alternative approach which substantially reduces the amount of computation which is 
involved without at the same time compromising the quality of the PDF representation too much. 

We shall thus present a hierarchical cluster decomposition scheme. This is equivalent to a multilayer feedforward 
unsupervised neural network which forms a suitable hierarchical feature space for decomposing the input image. The 
node functions and the learning algorithm which we use are related to the self-organising scheme which has been 
proposed by Kohonen [1]. The PDF representation is completed by measuring a set of co-occurrence matrices on 
the network and combining them to produce the maximum entropy estimate of the true PDF. This representation is 
closely related to that used in the WISARD pattern recognition device. 

This approach is much cheaper computationally than the MRF approach. During training we improve Kohonen’s 
algorithm by introducing a renormalisation scheme in which we progressively increase the number of “neurons” in order 
to optimise the use of computational resources. During testing an image propagates forwards through the network 
in one pass, and the appropriate cooccurrence matrix elements are picked up. This type of network could find use in 
Bayesian image processing applications which currently use an MRF. The speed of use of the cluster decomposition 
scheme (both training and testing) is such that real-time adaptive PDFs might also be possible. 

We shall present some simple demonstrations of this type of neural network, and where possible we shall compare 
the theoretical with the actual PDF. 

[1] Kohonen T, 1984, “Self-organisation and associative memories”, Springer, Berlin 


*Typeset in JATRX on May 3, 2019. 
This summary was submitted to the IEEE Conference on Neural Information Processing Systems (1988) on 9 May 1988. Paper #295. It 
was not accepted for presentation, but it underpins several subsequently published papers. 


Adaptive Cluster Expansion (ACE): A Hierarchical Bayesian Network * 


S P Luttrell 
Room EX21, QinetiQ, Malvern Technology Centre 


Using the maximum entropy method, we derive the “adaptive cluster expansion” (ACE), which can 
be trained to estimate probability density functions in high dimensional spaces. The main advantage 
of ACE over other Bayesian networks is its ability to capture high order statistics after short training 
times, which it achieves by making use of a hierarchical vector quantisation of the input data. We 
derive a scheme for representing the state of an ACE network as a “probability image”, which allows 
us to identify statistically anomalous regions in an otherwise statistically homogeneous image, for 
instance. Finally, we present some probability images that we obtained after training ACE on some 
Brodatz texture images — these demonstrate the ability of ACE to detect subtle textural anomalies. 


I. INTRODUCTION 


The purpose of this paper is to train probabilistic net- 
work models of images of homogeneous textures for use 
in Bayesian decision making. In our past work in this 
area [11-14, 16] we successfully used entropic methods 
to design Markov random field (MRF) models to repro- 
duce the observed statistical properties of textured im- 
ages. We now wish to formulate a novel MRF structure 
that requires much less effort to train and use. There 
are two essential ingredients in our simplification: we do 
not use hidden variables, and we restrict our attention to 
hierarchical transformations of the data. 

The use of hidden variables is a flexible way of mod- 
elling high order correlations in data [1], but it leads 
to lengthy Monte Carlo simulations to estimate aver- 
ages over the hidden variables. An MRF without hid- 
den variables is specified by a set of transformation func- 
tions, each of which extracts some statistic from the data, 
and together they provide sufficient information to com- 
pute the probability density function (PDF) of the data 
[14, 16]. 

We can obtain a wealth of statistical information about 
the data by restricting our attention to a finite number 
of well-defined transformation functions. For instance, 
in [5] a number of useful textural features are presented, 
which may be used to model and discriminate between 
various textures that occur in images. However, we wish 
to design our transformation functions adaptively in a 
data-driven manner, so that the resulting set is opti- 
mised to capture the statistical properties of the data. 
We choose to use adaptive hierarchical transformation 
functions, because these not only capture statistical prop- 
erties at many length scales, but are also very easy to 
train. 


We briefly discussed hierarchical transformation func- 


*Typeset in JATRX on May 3, 2019. 

This is archived at http://arxiv.org/abs/cs/0410020, 10 Oct 2004. 
This paper was submitted to IEEE Trans. PAMI on 2 May 1991. 
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tions in [21], where we conjectured that topographic map- 
pings [8] might be appropriate for connecting together 
the layers of the hierarchy. We investigated topographic 
mappings in [15, 17, 19, 22, 23] and found that they could 
be rapidly trained to produce useful multiscale represen- 
tations of data. We therefore use multilayer topographic 
mappings to adaptively design hierarchical transforma- 
tion functions of data for use in MRF models. In this 
type of model different layers of the hierarchy measure 
statistical structure on different length scales, and shorter 
length scale structures are clustered together and cor- 
related to produce longer length scale structures. We 
therefore frequently refer to this type of scheme as an 
adaptive cluster expansion (ACE). By interpreting ACE 
as a multilayered n-tuple processor we can relate ACE to 
a multilayered version of WISARD [2]. 

We demonstrate the ability of ACE to learn the statis- 
tical structure of texture by training an adaptive pyramid 
image processor. There are many ways of displaying the 
statistical information extracted from the data by such a 
processor, but we prefer to use what we call a “probabil- 
ity image”, which is generated from the estimated local 
PDF of the data. 

The layout of this paper is as follows. In Section II 
we use the maximum entropy method to estimate the 
PDF of the data, subject to a set of marginal probability 
constraints measured using hierarchical transformation 
functions, to yield an MRF model in closed form (i.e. no 
undetermined Lagrange multipliers). In Section III we 
extend this result to remove some of the limitations of its 
hierarchical structure, such a translation non-invariance, 
and describe the ACE system for producing probability 
images. In Section IV we present the result of applying 
ACE to some textured images taken from the Brodatz 
set [3]. 


II. MAXIMUM ENTROPY PDF ESTIMATION 


In this section we present a derivation of a hierachi- 
cal maximum entropy estimate Qmem(a) of an observed 
true PDF P(a), where we constrain Qmem(a) so that cer- 
tain marginal PDFs agree with observation. Although we 
consider only the case of a binary tree, we also present 
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a simple diagrammatic representation of this result that 
allows us easily to extend it to general trees. 


A. Basic maximum entropy method 


For completeness, we first of all outline the basic prin- 
ciples [6, 7] of the maximum entropy method of assigning 
estimates of PDFs. Introduce the entropy functional H 


| 42 Qe) toe (Sc) 


in which the PDF Q(z) is used to introduce prior knowl- 
edge about P(a). Loosely speaking, H measures the ex- 
tent to which Q(a) is non-committal about the value that 
x might take. The maximum entropy method consists of 
maximising H subject to the following set of constraints 


[eau 2) ~ [ de Pa) yi (a 


= 0 

where the y;(a) are the components of a vector y(x) of 
sampling functions. These constraints ensure that cer- 
tain average values are the same whether they are mea- 
sured using Q(a) (i.e. our estimated PDF) or using P(x) 
(ie. the observed true PDF). By carefully selecting the 
y(a) we can optimise the agreement between Q(a) and 
P(a) as appropriate. 


(2.1) 


Ci 


(2.2) 


Qmem(x) may be found by introducing a vector A 
of Lagrange multipliers, and functionally differentiating 
H—)Y°,.C1,4 with respect to Q(a) to yield eventually 


Qo(a) exp(—A.y(x)) 


Qmem(#) = f da’ Qo(x’) exp(—A.y(z’)) 


(2.3) 


The undetermined Lagrange vector A must be chosen in 
such a way that the constraints are satisfied — this is 
usually a non-trivial problem. 


Now we shall consider a special case of the maximum 
entropy problem in which we carefully design the y;(x) 
so that they constrain a set of marginal probabilities [16]. 
Thus we make the following replacements 


yi(e) —> d(y — y(a)) 
where 6(y — y(a)) is a Dirac delta function. In the 


{y;(a), A; } version of the maximum entropy problem, by 
varying the value of an index 7 we could scan through 
the set of constraint functions y;(a2) and Lagrange mul- 
tipliers A;. However, in the {d(y — y(a)), A(y)} version 
of the maximum entropy problem, by varying the value 
of a variable y we can scan through the set of constraint 
functions d(y — y(a)) and Lagrange multipliers A(y). 

The modification in Equation 2.4 causes the con- 
straints in Equation 2.2 to become 


Coly) = / da: Q(a) 5(y — y(«)) — i, de P(e) 6(y — y(x)) 


(2.6) 


Thus the delta function constraints force Q(y) = P(y). 
Note that we have used a rather loose notation for our 
PDFs — P(a) and P(y) are in fact different functions of 
their respective arguments. We have made this choice of 
notation for simplicity, because the context will always 
indicate unambiguously which PDF is required. 


By analogy with the previous maximum entropy 
derivation, Qmem(«) may be found by functionally dif- 
ferentiating H — [ dy \(y) C2(y) with respect to Q(x) to 


(2.5) 
yield 
Qo(x) exp(—A(y(a))) 
Gaol") =" “Tag Oye) xa AGE) 
—> Qo(x) f(y(a)) (2.8) 


where A(y(a)) is an undetermined Lagrange function of 
y(a). In Equation 2.8 we present a simpler notation 
by introducing an undetermined function f(y(a)) to ab- 
sorb the exponential function and the denominator term 
that appeared in Equation 2.7. We may impose the con- 
straints in Equation 2.5, and use the definitions of Q(y) 
and P(y) in Equation 2.6 to obtain f(y) in the form 


Pty) 
J da! Qo(x’) 5(y — 


fy) = (2.9) 
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and Qmem(a) in the form 


_____ Qo(z) Ply(e)) 
J da! Qo(@) Sula) — ye") 


Note that this result is a closed form solution because 


Qmem(2) (2.10) 


[22 Qnem(a) = [exavity 


I 
ed 
a 

eS 
pis) 
= 


it contains no undetermined Lagrange functions, unlike 
Equation 2.3 which contains an undetermined Lagrange 
vector A. The normalisation of this solution can be veri- 
fied as follows 


Qo(x) Py) 


gone ay Ve) 


(2.11) 


where we use the identity [ dy 6(y — y(a)) = 1 to create a dummy integral over y. 


B. Hierarchical maximum entropy method 


The purpose of this subsection is to present a general- 
isation of Equation 2.10 that uses hierarchical transfor- 
mation functions. 

In practice the result in Equation 2.10 has a limited 
usefulness. Firstly, we would like to impose many simul- 
taneous constraints, each using its own constraint func- 
tion 6(y; — y:(@)) in Equation 2.5, but this cannot in 
general be done without sacrificing our closed form so- 
lution in Equation 2.10. Secondly, we would like to im- 
pose higher order constraints, using a constraint function 
d(y — y(x)). This may easily be done by making the re- 
placement y —> (y1,Y2,--:) in Equation 2.10. However, 
there is a hidden problem, because the greater the dimen- 
sionality of y, the less easy is it to make the necessary 
measurements to establish the form of P(y). Fortunately, 
there is a solution to both of these problems, which we 
shall describe below. 

We shall apply the maximum entropy method with 


Tes Gap) = Sk age Ue te gk 8 (kd) Ugg tle a) Lesa ese) 


constraints of the form shown in Equation 2.5 to a hi- 
erarchy of transformed versions of the input vector x. 
In order to make our calculation tractable we intro- 
duce the notation shown in Figure 1. The x;;,... are 
various partitions of the input vector x, the yj;x... are 
various transformed versions yijx...(@ijx...) of the input 
Lijr..., and the fizz... wie... are the Lagrange functions 
Fgh itgtht- (Yajhe-s Yi jl kel) that appear in the gener- 
alised version of the maximum entropy solution Qmem (x) 
in Equation 2.7. 


We choose to write the dependence of yx... directly 
on the input @jx..., even though the value of y;x... 
is obtained via a number of intermediate transforma- 
tions leading from the leaf nodes of the tree up to node 
ijk..., because this leads to a transparent hierarchical 
maximum entropy derivation. It is convenient to define 
Il jj...(@ yx...) aS the product of the Lagrange functions 
that appear beneath node ijk... of the tree. Iljyx... (yr...) 
has the following recursion property 


(2.12) 


Also introduce a normalisation (or Jacobian) factor defined as 


Leggs Vigne) = [dein OU ei = Yijk--- (Ligh...) Tse Am aes.) 


(2.13) 


which is a sum of Ijjx...(@iyx...) over all the states x,,... of the leaf nodes beneath node ijk... that are consistent with 


Yijk... emerging at node ijk.... 


The proof of the general hierarchical maximum entropy result proceeds inductively. Firstly, we generalise Equation 


2.4 to become 


Yi(@) —> O(Yajhed — Yigh-1 (Ligke--1)) O(Yiyk--2 — Yigh--2(Lijk---2)) 


AG Se Age Wasa 2) 


Secondly, we generalise Equation 2.8 to become 


Qmem(®) = Qo(X) fi,2(y1 (1), y2(#2)) Th (#1) Io (x2) 


(2.14) 


(2.15) 
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Figure 1: Notation used in the hierarchical maximum entropy derivation. 


where we display the Lagrange function f1,2(y1, y2) that connects the topmost node-pair (i.e. node-pair (1,2)) in the 
tree, but conceal the other Lagrange functions by using the I]j;,... notation. 

We may determine the exact form of fi,2(y1,y2) independently of the rest of the Lagrange functions (which are 
hidden inside the II; (a1) and I2(a#2) functions) by imposing the constraint shown in Equation 2.5 and Equation 2.6 
(as applied to node-pair (1,2)) to obtain 


Penis / diy day 6(y1 — ys(a1)) 6(y2 — yo(a2)) Omem (2) 


= fi2(yi,y2) Z1(y1) Z2(ye) (2.16) 
which yields 
— Pray, yz) 
fi2(1. 92) = Zi) Fae) (2.17) 


Substituting this result back into Equation 2.15 yields 
II, (a1) II2(ax2) 
Zi (yi(@1)) Z2(y2(x2)) 


which correctly obeys the constraint on the joint PDF P;,2(y1,y2) of the topmost pair of nodes in the tree. 
We now marginalise Qnem(#) in order to concentrate our attention on the left-hand main branch of the tree. Thus 


Qmem() = Pi2(y1 (£1), y2(£2)) (2.18) 


Q1,mem(#1) = [ees Qmem(a) 


if dx2 dy d(y2 =. Yy2(x2)) Qmem (x) 
II, (a1) 
Z1(yi(1)) 


We now use the recursion property given in Equation 2.12 to extract the Lagrange function associated with node-pair 
(11,12). Thus Qi mem(#@1) becomes 


= Pi(y(#1)) (2.19) 


Th1(#11) Wie (#12) 
Z1(yi(1)) 


As before, we may determine the exact form of f11,12(y11, yi2) independently of the rest of the Lagrange functions by 
applying the constraints to node-pair (11,12) to obtain 


(2.20) 


Ci emler) =P, (yt (a@1)) fitj12(y11 (#11), y12(#12)) 


Py112(y11, y12) Z1(y1) 
Pi(y1) 211 (Y11) Z12(Y12) 


where the value of y; is to be understood to be obtained directly from the values of y;; and yi2g via the mapping 

which connects node-pair (11,12) to node 1. Substituting this result into Equation 2.20 yields 
Ty1 (#11) Ty2(@12) 

Zi (y11(#11)) Z12(y12(#12)) 


fit12(yi1; yi2) = (2.21) 


Oivwnem (£1) = Pyi12(y11 (#11), Y12(#12)) (2.22) 
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By inspection, we see that Equation 2.18 and Equation 2.22 are identical in form once we have accounted for their 
different positions in the tree, so we may use induction to obtain all of the rest of the Lagrange functions in the form 


Zijh- (Yij--) 
Zijk---1 (Yigk--1) Zigk--2(Yajh---2) 


Pept eed Vitenits Yijk--2) 
Pigt.-- (Yijk---) 


Figh---1ijh---2(Yigh---15 Yigk--2) = (2.23) 


which is analogous to Equation 2.21, and where yx... is obtained directly from the values of yyx...1 and yiyg...2. The g 
factors may be discarded once we reach the leaf nodes of the tree, because the integral in Equation 2.13 then reduces 
to Z =I. 

Finally, by starting with Equation 2.15 and recursively simplifying the Hj... using Equation 2.12 and substituting 


for the Lagrange functions fix... 4774’... using Equation 2.23 we obtain eventually for an n-layer tree 


2 


Oil) = Ul II 


PEs a kate Dik, ket CER eset Ung ios BA odco)) 


k=0 14 ,22,°°° tp= 


2 


x! I 


Uta teal 


Pdaxicg (Rise) 


1 Pivin int Yar ta--igd (Vin ta--in1)) Pinta ig 2 Vir in---i,2 (Pir in---i,2)) 


(2.24) 


where we have rearranged the terms to collect together the factors that each node-pair (i1i2-+-t41,¢142 +++ i~2) con- 


tributes. 


Although we have concentrated on deriving Qmem(a) for a binary tree, the principle of the derivation carries over 
unchanged to arbitrary tree structures, and Equation 2.24 may easily be generalised. In Appendix B we explain 
the relationship of the single layer version of Equation 2.24 to the random access memory network that is known as 


WISARD [2]. 


C. Diagrammatic notation 


We now present the steps in the inductive derivation 
leading from Equation 2.18 to Equation 2.22 as a dia- 
gram in Figure 2. We use a triangle to represent a sub- 
tree, and we indicate its apex node, its associated II or 
Z factor, and its dependence on a. Figure 2a represents 
Equation 2.18, which is a pair of trees connected by the 
joint PDF of their apex nodes. By integrating over x2 
we remove the right hand tree to obtain Figure 2b, which 
corresponds to Equation 2.19. We then explicitly display 
the two daughter nodes to obtain Figure 2c, which corre- 
sponds to Equation 2.20, although we have grouped the 
terms together slightly differently, for simplicity. This 
exposes one of the Lagrange functions which we deter- 
mine explicitly to obtain Figure 2d, which corresponds 
to Equation 2.22. One cycle of the inductive proof is 
completed by noting the correspondence between Figure 
2a and Figure 2d. 


We represent Equation 2.24 in diagrammatic form in 
Figure 3. The tree structure represents the flow of the 
transformations of the original input data x. Each square 
cornered rectangle represents the marginal PDF of the 
enclosed node-pair (i.e. one P;,;,...;,, term from the sec- 
ond factor in Equation 2.24). Each round cornered rect- 
angle represents the normalised marginal PDF of the en- 
Pizige ipl izig:ip2 
ipigsigt Pizig---ip2 
first factor in Equation 2.24). Overall, we obtain Equa- 
tion 2.24 as the product of the rectangles in Figure 3. 


cosed node-pair (i.e. one term from the 


This notation makes it easy to generalise the re- 


sult in Equation 2.24 in a purely diagrammatic fashion, 
by firstly constructing an arbitrary (i.e. not necessar- 
ily binary) tree-like transformation of the input data, 
and secondly using as maximum entropy constraints the 
marginal PDF of each set of sister nodes in the tree. 
This prescription permits many possible ACE structures, 
including those in which different constraints effectively 
operate between different layers of the hierarchy (by map- 
ping one or more node values directly from layer to layer). 


Each rectangle representing a marginal PDF in Figure 
3 contributes to the maximum entropy estimate of the 
PDF of a cluster of nodes in the input data. Because of 
the tree structure, clusters at each length scale are built 
out of clusters at smaller length scales. Equation 2.24 
tells us exactly how to incorporate into Qmem(x) any 
additional statistical properties that might be observed 
when forming larger clusters out of smaller clusters in 
this way. 

Finally, Figure 3 suggests an informal derivation of 
Equation 2.24. Thus the expression for the maximum 
entropy estimate of the joint PDF of the input data x 
in Equation 2.18 can be viewed as the joint PDF of the 
pair of nodes at the top of the tree in Figure 3 times 
corrective Jacobian factors that compensate for effects of 
the many-to-one mapping that the input data undergoes 
before it reaches the top of the tree. The final maximum 
entropy expression in Equation 2.24 merely enumerates 
these corrective Jacobian factors explicitly in terms of 
marginal PDFs measured at various levels of the tree. 
This makes it clear that the maximum entropy method 
gives a result that is consistent with simple counting ar- 
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Figure 2: The individual steps of the inductive hierarchical maximum entropy derivation. 


Figure 3: A diagrammatic representation of the hierarchical maximum entropy result. 


guments, which could therefore be used in place of the 
rather involved maximum entropy derivation. 


II. IMPLEMENTATION OF AN ANOMALY 
DETECTOR 


Henceforth we shall refer to our hierarchical maximum 
entropy method as an adaptive cluster expansion (ACE). 
In this section we describe how to implement Equation 
2.24 in software. We assume that the ACE transforma- 
tion functions have already been optimised using the un- 
supervised network training algorithm that we describe 
in Appendix A and in [19], so the purpose of this sec- 
tion is to explain how to manipulate Equation 2.24 into 
a form that produces a useful output from the network. 
For concreteness, we produce an output in the form of 
an image that represents the degree to which each local 
patch of an input data is statistically anomalous, when 
compared to the global statistical properties of the input 
data. 


A. Two-dimensional array of inputs 


In Section II we represented ACE as if it were operating 
on a 1-dimensional arrays of inputs (e.g. time series). 
In practice this might indeed be the case, but in this 
paper we choose to study 2-dimensional arrays of inputs 
(e.g. images). There is no difficulty in applying ACE 
to an image, provided that we appropriately assign the 
leaf nodes to pixels of the image. In Figure 4 we show 


Figure 4: ACE connectivity for processing a 2-dimensional 
array of inputs. 


the simplest possibility in which the image is alternately 
compressed in the north-south and east-west directions. 
A priori, the choice of whether to start with north-south 
or east-west compression is arbitrary, but if we knew, 
for instance, that the image had stronger short range 
correlations in the east-west direction than the north- 
south direction, then it would be better to compress east- 
west first of all. Note that in Figure 4 the topology of 
the tree is the same as in Figure 3, but the way in which 
the leaf nodes are identified with the data samples is 
different. 


More generally, we could identify the leaf nodes of the 
tree with the image pixels in any way that we please, pro- 
vided that no pixel is used more than once (to guarantee 
that the tree-like topology is preserved). The problem of 
optimising the identification of leaf nodes with pixels is 
extremely complicated, so we shall not pursue it in this 
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paper. 


B. Histograms 


The maximum entropy PDF in Equation 2.24 is a prod- 
uct of (normalised) marginal PDFs. In a practical imple- 
mentation of ACE the yx... are discrete-valued quanti- 
ties (for instance, integers in the interval [0, 255]), and the 
Pig... ae (Gres Yul 5! ke!) are probabilities (not PDFs). 
We estimate the I akis agtht(Yighes Yarj'h!-) by con- 
structing 2-dimensional histograms 


Pgh. ijt Yighens Yirjthl ee) & 
1 


3.1 
Nijk--- ij! kt oy 


Figpicca pits ape tee) 


where Duajhe-- ijt hel (Yaghess Yar j/k) is the number of 
counts in the histogram bin (yyyg...,yirje’.), and N is 
the total number of histogram counts given by 


Nijk..- Wil kl = 


es S- Raph. ght Yarns Yarge’s-) (3.2) 


VYigk--» Yili kl... 


Note that the estimate in Equation 3.1 suffers from Pois- 
son noise due to the finite number of counts in each his- 
togram bin. 

In order to build up this estimate we first of all train 
the ACE transformation functions as explained in Ap- 
pendix A. The histogram bins are then initialised to 
zero, and subsequently filled with counts by exposing the 
trained ACE to many examples of input vectors (possi- 
bly, the set used to train the transformation functions). 
Thus each vector is propagated up through the ACE-tree, 
and we then inspect each node-pair (ijk--- ,7i’j’k’---) for 
which a marginal probability needs to be estimated, and 
increment its corresponding histogram bin thus 


Peale sil jth (Yazjhee--y Yar jth) — 


Pighse 2gthhs. Yaggon Uijre) +1 (333) 


When the training set has been exhausted, histogram bin 
(ijk +++ ,i'7'k’---) records the number of times that state 
(Yaghe---s Yu jth!) occurred. 


A major disadvantage of using histograms is that they 
have a large number of adjustable parameters (i.e. the 
number of counts in each bin) that have to be determined 
by the training data, so they do not generalise very well. 
However, for the purpose of this paper, we do not need 
to resort to using more sophisticated ways of estimating 
PDFs. 


C. Translation invariant processing 


We wish to detect statistical anomalies in images which 
have otherwise spatially homogeneous statistics, such as 
textures. An invariance of the statistical properties of 
the true PDF P(a) can be expressed as 


P(Gx) = P(x) (3.4) 


where G is any element of the invariance group, which we 
shall assume to be the group of translations of the im- 
age pixels. In Equation 2.24 Qmem(x) does not respect 
translation invariance for two reasons. Firstly, we use 
transformations yjyx...(@yx...) that are explicitly transla- 
tion variant, because the functional form depends on the 
wk--- indices. Secondly, we connect together these trans- 
formations in translation variant way, because the tree 
structure in Figure 1 and Figure 4 does not treat all of 
its leaf nodes equivalently. We shall therefore modify the 
cluster expansion procedure that we derived in Section 
ITB to guarantee translation invariance. This will lead to 
a much improved maximum entropy estimate Qmem (x) 
of the true P(a). 


Firstly, use the same transformation function at each 
position within a single layer of ACE. Thus in Equation 
2.24 we make the replacement 


Ue eee ee aera) — Y” (Biz éqerigt) 
Vis tacaya Pigtes ga) —> YBa) (8.5) 


where we indicate that the transformation is associated 
with the k-th layer of ACE by attaching a superscript k 
to each function. This yields 


Piescdhgiaialy” uted ey eis oe) 


Qmem (a) = Ul Il 


k=0 01,02,°°° th= 


2 


x! I 


102,77 te=l1 


Pig ies, Bivins 4) 


1 Pista int (Y* (in iain) Piris--in2(Y* (Birin-i,2)) 


(3.6) 


Equation 3.6 guarantees translation invariance (in the sense of a “single-instruction-multiple-data” computer) of the 
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processing that occurs when the input data is propagated upwards through the overlapping trees. 
Secondly, assume that Equation 3.4 holds for all image translations, so that the marginal PDFs are independent of 
position. We may make this explicit in our notation by making the following replacement in Equation 3.6 


Pie iO) P¥9(-) 
Pideogit ) Page) PE(-) P#(-) 
Pryigein() —t P?() (3.7) 


where we use the same superscript notation as in Equation 3.5. This yields 


Q (x) _ Tl ial Piao" erscil), iy" (@isisaee)) 
1 PE (y* (@irig--int))], PE (y* (Wir in---tn2)) 


k=0 01,02,°° te= 


2 


x II PO ies. 40.) (3.8) 


t1,2,°7 tR=1 


Equation 3.8 guarantees not only translation invariance of the transformations that propagate the data through the 
tree, but also translation invariance of the marginal PDFs of P(«) that are used to construct Qmem(2). 

Both of the simplifications in Equation 3.5 and Equation 3.7 reduce the total number of unknowns that have to be 
determined. For a given amount of training data we can thus construct a better maximum entropy estimate Qmem(Z) 
of the true P(x). The transformation functions may be optimised better, and the histogram bins have a reduced 
Poisson noise. 

We usually apply ACE to such large input arrays that it is not appropriate to build a single binary tree whose leaf 
nodes encompass the entire input array. Instead, we divide the input array (which we shall assume is a 2” x 2” 
array of image pixels) into a set of contiguous 2 x 2” arrays, each of which we analyse using Equation 3.8. There 
are no constraint functions to measure the mutual dependencies between these subarrays, so the maximum entropy 
joint PDF of the set of subarrays is a product of terms of the form shown in Equation 3.8. 


en i a li k; a1,4a ke (1542 
Preys es) ae ee) 
log(Qmem(# log 1,2 Uta: th inate 
a X » wee PE(y*(x Diniy a) P, UG (@ 25) 


gM—m, 9M—m2 


lee) am Tog (PN ait) (3.9) 


ai=1  ag=1 1%1,22,-+ 4 


The summation over (a1, @2) ranges over the 2?”/—™—™2 contiguous subarrays in the overall 2” x 2” array, and the 


a1, @2 superscript on each aj... vector indicates that it belongs to subarray (a1, a2). Note that we have transformed 
Qmem(x) —> log(Qmem(ax)) for convenience. 

The final step in constructing a fully translation invariant PDF is to modify the sum over subarrays so that it 
includes all possible placements of the 2” x 22 subarray within the overall 2” x 2” array. There are 2?/@7~™1~™2 
possible positions when the placement of the subarray is restricted as in Equation 3.9, whereas there are (2 — 
21 + 1)(2™ — 2™2 4 1) possible positions when all placements of the subarray are permitted. We therefore make the 
replacement 


QM mig me 92M—m1—me aM _—9™1412M_9™241 


ae Pe ” QM om +1) QM — 2 41) os x) 


pi=l p2=1 


ae Se) (3.10) 


Pi=1p2=1 


in Equation 3.9, where (pi, p2) is the coordinate of the pixel in the top left hand corner of the 2™ x 2™ subarray. 
If we ignore edge effects, then we may use the approximation in the final line of Equation 3.10, which is the average 
of 27™+™2 separate contributions of the form shown in Equation 3.9. Equation 3.10 effectively replaces the original 
maximum entropy PDF Qmem(x) by the geometric mean of a set of maximum entropy PDFs. This averaging reduces 
the problems caused by Poisson noise on the histogram bin contents to yield a greatly improved maximum entropy 
PDF estimate. 
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Figure 5: Connectivity for multiple overlapping binary trees. 


In practice, we would implement each layer of ACE as a frame store, and the transformation between each pair 
of adjacent layers as a look-up table. The translation invariant ACE that we derived in Equation 3.9 (with the 
replacement given in Equation 3.10) may be implemented using the connectivity shown in Figure 5. Ignoring edge 
effects, we may write Equation 3.9 symbolically as 


n—2 1 pk 1 
k=0 


where the inner summations range over all positions within a single layer of Figure 5. We omit all of the functional 
. : : PE 7 ; 
dependencies, because they are easy to obtain from Figure 3. Each ass term is represented by a rectangle with 
1 2 


rounded corners in Figure 3, and each P”—! term is represented by a rectangle with square corners in Figure 3. We 


have not drawn these rectangles in Figure 5 because they would overlap, and thus confuse the diagram. 
I 


D. Forming a probability image 


Equation 3.11 is the fundamental result that we use to 
construct useful image processing schemes. However, it 
would not be very useful simply to calculate the value of 
log(Qmem) as a single global measure of the logarithmic 
probability associated with an image. We choose instead 
to break up Equation 3.11 into smaller pieces, and to 
examine their contribution to the overall log(Qmem). In 
effect, we look at how log(Qmem) is built up from the 
information in each layer of ACE, which in turn we break 
down into contributions from different areas of the image. 


Figure 6: Backpropagation scheme for constructing a proba- 
bility image. 


In order to ensure that our decomposition of 
log(Qmem) can be easily computed, we use the backprop- 
agation scheme shown in Figure 6 to control the data 
flow through a translation invariant network of an iden- 


tical connectivity to the one shown in Figure 5. Each 


11 


node of this backpropagation network records a logarith- 
mic probability, and is cleared to zero before starting the 
backpropagation computations. The rectangles in Fig- 
ure 6 represent exactly the same logarithmic probability 
terms that appeared in Figure 3, which we now use as 
sources of logarithmic probability that we inject into the 
backpropagating data flow. 


The detailed operation of Equation 3.6 is as follows. 
Each addition symbol takes as input a contribution 
recorded at a node in the next layer above, adds its own 


k 
logarithmic probability source log (see) , scales the re- 
1 2 


sult by i and it finally adds a copy of this result to the 
value stored at each of its own pair of associated nodes, as 
shown. The values that accumulate at the leaf nodes rep- 
resent various contributions to the sum in Equation 3.11. 
If the translation invariant version of Figure 6 is applied 
to the translation invariant network shown in Figure 5, 
then the sum of the values that accumulate at the leaf 
nodes reproduces Equation 3.11 precisely. 


This method of computing log(Qmem) might seem to 
be circuitous, but it has the great advantage of both being 
computationally cheap and forming an image-like repre- 


sentation of log(Qmem), which we call a “probability im- 


ob) Pro 
age”. Each log | pepe 
ik 2 


term in Equation 3.11 will con- 
tribute equally to 2”—* pixels in the probability image. 
These pixels will be arranged as either a square or a 2- 
to-1 aspect ratio rectangle according to whether there is 


an odd number or even number of backpropagation steps 
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from the k-th layer to the leaf nodes. The probability im- 
age is therefore a superposition of square and rectangular 
tiles of logarithmic probability. Each tile corresponds to 
a node of the network shown in Figure 5. 

It is useful to display as an image the contributions 
of a single layer of the network to the probability im- 
age, because different layers contribute to the structure 
of log(Qmem) at different length scales. This image may 
be displayed in the conventional way, with small proba- 
bilities mapped to black, large probabilities mapped to 
white, and intervening probabilities mapped to shades of 
grey, in which case we call it a “probability image”. It is 
also useful to invert the grey scale so that small probabil- 
ities map to black, in which case we call it an “anomaly 
image”, because regions which have statistical properties 
that occur infrequently show up as bright peaks in the 
image. We find that the use of probability images and/or 
anomaly images is an extremely effective way of visually 
interpreting log(Qmem) in Equation 3.11. 


E. Modular implementation 


For completeness we now present a brief description 
of a complete system for producing probability and/or 
anomaly images. This system consists of two tightly cou- 
pled subsystems — an ACE subsystem for decomposing 
the image data, and a probability image subsystem for 
forming the output image. Figure 7 combines in one dia- 
gram all of the results that we have discussed so far. The 
upper part of Figure 7 is a pure translation invariant 
ACE subsystem, whereas the lower half is a backpropa- 
gating probability image subsystem operating as shown 
in Figure 6. The backpropagating subsystem takes input 
information from various layers of ACE, as shown. Mod- 
ules “IT” are framestores that record the various trans- 
formed images. Modules “M” are look-up tables that 
record the inter-layer mappings. Modules “T” represent 
the training algorithm that we explain in Appendix A, 
which we enclose in a dashed box because the “T” mod- 
ules are switched out of the circuit once the mappings “M” 
have been determined. Modules “H” are accumulators 
that record the 2-dimensional histograms, and then reg- 
ularise and normalise them appropriately. Modules “P” 
are framestores that record the various backpropagated 
probability images. Modules “log” are look-up tables (in 
fact only one such table is needed) that implement a log- 
arithm function. Modules “6” and “®” perform the ad- 
dition and scaling operations that we discussed earlier in 
connection with Figure 6. “N” is scaling factor (which is 
+ if we wish to reproduce the result in Equation 3.11). 
The lines that are annotated “G” represent a ganging to- 
gether of the (pointers to) pixels in adjacent layers of the 
ACE subsystem and in the probability image subsystem. 
These ensure that the entire system works in lockstep, as 
required. 

The simplest mode of operation of this system can be 
broken down into three stages Firstly, train each layer 


(from left to right) of the ACE subsystem on a train- 
ing image. Secondly, propagate a test image (from left 
to right) through the layers of ACE. Finally, construct 
a probability image by backpropagating (from right to 
left) contributions from the various layers of ACE. Fur- 
thermore, it is useful to display separately the probability 
(or anomaly) images that emerge from each layer of ACE, 
as we shall see in Section IV. 

There is a variety of methods of optimising “T”, and 
hence “M”. The method that we describe in Appendix 
A trains each layer in sequence, which takes 2.3 sec- 
ond per layer (using a VAXstation 3100, and assuming 
6 bits per pixel), which gives a full training time of 20 
seconds for the 8 layer network that we use in our nu- 
merical simulations. We do not make use of more so- 
phisticated schemes in which different layers are simulta- 
neously trained, whilst communicating information with 
each other to improve the global performance of ACE. 


F. Relationship to co-occurrence matrix methods 


Both the basic maximum entropy PDF Qmem(a) in 
Equation 2.24, and the translation invariant version of 
log(Qmem(x)) in Equation 3.11 that we implement in 
practice, depend on various PDFs that are measured in 
an ACE-tree. The second term of Equation 3.11 may be 


written as 
Oven (2) =i [ee 


Each P”—! factor is the spatial average of the marginal 
PDF of pairs of adjacent pixel values, assuming that we 
use the identification of leaf nodes with pixels that we 
show in Figure 4. The square root in Equation 3.12 com- 
pensates for the fact that the product of P”~! factors 
generates the product of two maximum entropy PDFs 
shifted by one pixel relative to each other. 

By using Equation 3.1 we may approximate Equation 
3.12 as a product of histograms. In this case each his- 
togram is the spatial average of the co-occurrence matrix 
of pairs of adjacent pixel values, as commonly used in 
image processing [5]. Thus we may use conventional co- 
occurrence matrix methods to construct a simple form of 
maximum entropy PDF, which corresponds to using only 
one layer of ACE. 

This co-occurrence matrix result can be generalised, 
using Equation 2.24 or Equation 3.11, to model higher or- 
der statistical behaviour. Although these results depend 
on co-occurrence matrices measured at various places in 
the ACE-tree, the contributions which do not depend 
directly on the input data (i.e. the first term of Equa- 
tion 3.11) actually model higher order statistics of the 
input data. This is because the value yj... that emerges 
from node ijk--- of the ACE-tree depends on 2 ,jx..., 80 
the joint PDF Pit igk--2(Yajhe--15 Yigk--2) depends on the 
statistics of the pair (@ijx...1, Lyx-.2). Thus ACE is a 
very convenient way of combining together the various 


(3.12) 
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Figure 7: Three layer translation invariant ACE system. 


orders of statistical information that are contained in co- 
occurrence matrices at various places in the ACE-tree, as 
shown in Figure 3. 


IV. NUMERICAL RESULTS 


In this section we explain the finer details of how to 
implement Figure 7 in software, and we present the re- 
sults of applying the system to four 256 x 256 images of 
textiles taken from the Brodatz texture set [3]. 


A. Experimental procedure 


We compensated for some of the effects of non-uniform 
illumination by adding to each image a grey scale wedge 
whose gradient was chosen in such a way as to remove 
the linear component of the non-uniformity. Not only 
does this improve the translation invariance of the image 
statistics, but it also improves the quality of the hierar- 
chical coding of the image, because we reduce the need to 
develop redundant codes which differ only in their overall 
grey level. 

Throughout our experiments we generate optimal 
inter-layer mappings using the training methods that 
we explain in Appendix A. These are known as topo- 
graphic mappings in the neural network literature, and 
we showed in [18] why they are appropriate for build- 
ing multistage vector quantisers. We choose to compress 
the image in alternate directions using the following se- 
quence: north/south, east /west, north/south, east /west, 
etc. This compression sequence leads to the following 
sequence of rectangular image regions that influence the 
state of each pixel in each stage of ACE: 1x 2, 2x 2, 2x4, 
4x4, etc, using (east /west, north/south) coordinates. In 
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all of our experiments we use an 8 stage ACE. 


The number of bits per pixel that we use in each layer 
of ACE determines the quality of the hierarchical vec- 
tor quantisation that emerges. Increasing the number of 
bits improves the quality of the vector quantisation but 
increases the training time: we need to compromise be- 
tween these two conflicting requirements. In our work on 
simple Brodatz texture images we have found that 6-8 
bits per pixel is sufficient. 


It is important to note that for a given number of bits 
(after compression) there is an upper limit on the allowed 
entropy that the input data can have. This problem be- 
comes more severe the greater the data compression fac- 
tor (i.e. the further we progress through the layers of 
ACE). For instance, if the input image is very noisy then 
6-8 bits will be sufficient only to give good vector quan- 
tisation performance in the first few layers of ACE. This 
problem arises because ACE does not have much prior 
knowledge of the statistical properties of the input data, 
so each node of ACE encodes its input without assuming 
a prior model. A prior model would allow us to reduce 
the bit rate. This is a fundamental limitation to the ca- 
pabilities of the current version of ACE. 


The choice of the size of the 2-dimensional histogram 
bins is also important. A property of the topographic 
mappings that we use to to connect the layers of ACE 
is that adjacent histogram bins derive from input vectors 
that are close to each other (in the Euclidean sense), so it 
is sensible to rebin the histogram by combining together 
adjacent bins. Thus we control the histogram bin size by 
truncating the low order bits of each binary vector that 
represents a pixel value. If we do not truncate any bits, 
then the 2-dimensional histogram faithfully records the 
number of times that a pair of pixel values has occured. 
However, if we truncate b low order bits of each pixel 
value then effectively we sum together the histogram bins 
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in groups of 2? (= 2° x 2°) adjacent bins, which smooths 
the histogram. The more smoothing that we impose the 
less Poisson noise the histogram suffers. However, as we 
smooth the histogram we run the danger of smoothing 
away significant structure that might usefully be used 
to characterise the input image: so we need to make a 
compromise. In our Brodatz texture work we use only 
4-6 bits of each pixel value to generate the histograms 
in each stage of ACE. Note that we use more bits for 
vector quantisation than for histogramming because the 
vector quantisation needs to be good enough to preserve 


Daj... ijt hel (Yaghenss Ya j/k) —_ { ( 


where the angle brackets (---) denote an average over 
histogram bins, rounded up to the next largest integer to 
avoid setting histogram bins to zero. Secondly, we esti- 
mate the probabilities Pgh. algtht (Yighes Yatjth!-) by in- 
serting the regularised histograms into Equation 3.1. We 
use a marginalised version of Equation 3.1 to estimate the 
marginal probabilities Pyjx...(Yizz...). Finally, we compute 
the logarithmic probabilities in Equation 3.11 by using a 
table of logarithms of integers, up to the maximum pos- 
sible number of counts that could occur in a histogram 
bin — it suffices to tabulate logarithms up to log(V). 


The prescription in Equation 4.1 is crude but effective. 
We could improve the performance by introducing prior 
knowledge of the statistical properties of the input data. 
Our histogram smoothing prescription already implicity 
makes use of prior knowledge of the properties of the 
Posson noise process that affects the histogram counts, 
and prior knowledge of the fact that adjacent histogram 
bins correspond to similar input vectors. Additional prior 
knowledge would further enhance the performance, espe- 
cially in cases where there is a limited amount of training 
data (such as small images, or small segments of larger 
images). 


A pitfall that must be avoided is using histogram bins 
that are too small when one intends to train ACE on 
one image and then use a different image to generate a 
probability image. Effectively, the large number of small 
bins records the details of the statistical fluctuations of 
the training image (as particular realisations of a Poisson 
noise process in each bin), which thus acts as a detailed 
record of the structure in the training image. The his- 
tograms thus look very spiky, and in an extreme case 
there may be a counts recorded in only a few bins with 
zeros in all of the remaining bins. If this situation occurs 
then the training image records a large log(Qmem(2)), 
whereas a test image having the same statistical proper- 
ties records a small log(Qmem(a)). Effectively, the spikes 
in the training and test image histograms are not coin- 


higk.-. 
hig... 


information for encoding by later layers of the hierarchy, 
whereas the histogramming information is not passed to 
later layers. 

In Equation 3.11 we need to estimate the logarithm 
of various probabilities from the histograms. We do this 
in two stages. Firstly, we regularise the histograms by 
placing a lower bound on the permitted number of counts. 
One possible prescription is to ensure that each histogram 
bin has a number of counts at least as large as the average 
number of counts in all the histogram bins (as determined 
before regularising the histogram). Thus 


ilgtht(Yighes Yigtht-) he vf 
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cident. This problem can be solved by choosing a large 
enough histogram bin size. 

Finally, we display the logarithmic probability image 
as follows. We determine the range of pixel values that 
occurs in the image, and we translate and scale this into 
the range [0,255]. This ensures that the smallest loga- 
rithmic probability appears as black, and the largest loga- 
rithmic probability appears as white, and all other values 
are linearly scaled onto intermediate levels of grey. This 
prescription has its dangers because each probability im- 
age determines its own special scaling, so one should be 
careful when comparing two different probability images. 
It can also be adversely affected by pixel value outliers 
arising from Poisson noise effects, where an extreme value 
of a single pixel could affect the way in which the whole 
of an image is displayed. However, we find that the over- 
lapping tree prescription in Figure 5 together with the 
backpropagation prescription in Figure 6, causes enough 
effective averaging together of the histogram bins that we 
do not encounter problems with pixel value outliers. 

In all of the images that we present below, we compen- 
sate for the uneven illumination by introducing a grey 
scale wedge as we explained earlier, we use 8 bits per 
pixel for vector quantisation, we use 6 bits per pixel for 
histogramming, and we invert the [0,255] scale to pro- 
duce an anomaly image, in which a white pixel indicates 
a small (rather than a large) logarithmic probability. 


B. Texture 1 


In Figure 8 we show the first Brodatz texture image 
that we use in our experiments. The image is slightly 
unevenly illuminated and has a fairly low contrast, but 
nevertheless its statistical properties are almost transla- 
tion invariant. 

In Figure 9 we show the anomaly images that de- 
rive from Figure 8. Note how the anomaly images be- 
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Figure 8: 


Figure 9: 256 x 256 anomaly images of Brodatz fabric number 
1. 


come smoother as we progress from Figure 9a to Figure 
9h, due to the increasing amount of averaging that oc- 
curs amongst the overlapping backpropagated rectangu- 
lar tiles that build up each image. 

Figure 9e and especially Figure 9f reveal a highly lo- 
calised anomaly in the original image. Figure 9f corre- 
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sponds to a length scale of 8 x 8 pixels, which is the 
approximate size of the fault that is about + of the way 
down and slightly to the left of centre of Figure 8. The 
fault does not show up clearly on the other figures in Fig- 
ure 9 because their characteristic length scales are either 
too short or too long to be sensitive to the fault. 

There is a major feature in the bottom right hand 
corner of Figure 9h, where the anomaly image is darker 
than average, indicating that the corresponding part of 
the original image has a higher than average probability. 
This is a different type of anomaly to the sort that we 
have envisaged so far — it occurs because the correspond- 
ing part of original image happens to explore only a high 
probability part of the space that is explored by the whole 
image. This part of the anomaly image is surrounded by 
a brighter than average border, which indicates a con- 
ventional anomalous region. 

From Figure 9 we conclude that ACE can easily pick 
out localised faults in highly ordered textures. 


C. Texture 2 


Figure 10: 256 x 256 image of Brodatz fabric number 2. 


In Figure 10 we show the second Brodatz texture image 
that we use in our experiments. The image has a high 
contrast and translation invariant statistical properties. 

In Figure 11 we show the anomaly images that de- 
rive from Figure 10. The most interesting anomaly im- 
age is Figure 11f which shows several localised anomalies. 
About halfway down and to the left of centre of the im- 
age is an anomaly that corresponds to a dark spot on the 
thread in Figure 10. The brightest of the anomalies in 
the cluster just above the centre of the image corresponds 
to what appears to be a slightly torn thread in Figure 
10. The other anomalies in this cluster are weaker, and 
correspond to slight distortions of the threads. There is 
another anomaly just below and to the right of the centre 
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Figure 11: 256x256 anomaly images of Brodatz fabric number 
2: 


of Figure 11g, which corresponds to what appears to be 
another slightly torn thread in Figure 10. These anoma- 
lies all occur at, or around, a length scale of 8 x 8 pixels. 
Several of the anomaly images show an anomaly in the 
bottom left hand corner of the image, which corrsponds 
to a small uniform patch of fabric in Figure 10. 

The results in Figure 11 corroborate the evidence in 
Figure 9 that ACE can be trained in an unsupervised 
fashion to pick out localised faults in highly ordered tex- 
tures. 


D. Texture 3 


In Figure 12 we show the third Brodatz texture im- 
age that use in our experiments.The image has a very 
high contrast and statistical properties that are almost 
translation invariant. However the density of anomalies 
is much higher than in either Figure 8 or Figure 10. 

In Figure 13 we show the anomaly images that de- 
rive from Figure 12. The most prominant anomaly is in 
Figure 13g, at a length scale of 8 x 16 pixels, which corre- 
sponds to region of Figure 12 that is just above and to the 
left of centre of the image. This region is anomalous be- 
cause it is both distorted and has slightly thicker threads 
than elsewhere. The large distorted region in the bot- 


Figure 12: 256 x 256 image of Brodatz fabric number 3. 


Figure 13: 256x256 anomaly images of Brodatz fabric number 
3. 


tom left hand corner of Figure 12 does not show up very 
clearly to the naked eye in Figure 13, but Figure 13f and 
Figure 13h have significant peaks in this region. There 
are also many other localised peaks in Figure 13 which 
can be traced back to corresponding faults in Figure 12. 

Comparing Figure 13 with Figure 9 and Figure 11 we 
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conclude that the ability of ACE to pick out faults is de- 
graded as the density of faults increases. This is because 
the faults themselves are part of the statistical proper- 
ties that are extracted by ACE, and if a particular fault 
occurs often enough in the image then it is no longer 
deemed to be a fault. 


E. Texture 4 


In this section we present a slightly different type of 
experiment in which we train ACE on one image and 
test ACE on another image. To create the two images we 
start with a single 256 x 256 image of a Brodatz texture, 
which we divide into a left half and a right half. We then 
use the left half to build up the training image, and the 
right half to build up the test image. 


Figure 14: 256 x 256 image of Brodatz carpet for training. 


In Figure 14 we show the training image which is a 
montage of two copies of the left hand half of a Brodatz 
texture image. Note that this montage contains only as 
much information as was present in the original half im- 
age from which it was constructed. In Figure 15 we show 
the test image which is a montage of two copies of the 
right hand half of a Brodatz texture image, and super- 
imposed on that is a 64 x 64 patch which we generated 
by flipping the rows and columns of a copy of the top left 
hand corner of this image. This patch is a hand crafted 
anomaly. Note that in constructing these images we have 
scrupulously avoided the possibility that the training and 
test images could contain elements deriving from a com- 
mon source. 

In Figure 16 we show the anomaly images that derive 
from Figure 15 after having trained on Figure 14. Figure 
16f shows the strongest response to the anomalous patch 
in the centre of the image, corresponding to anomaly 
detection on a length scale of 8 x 8 pixels. 
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Figure 15: 256 x 256 image of Brodatz carpet for testing. 


Figure 16: 256 x 256 anomaly images of Brodatz carpet. 


V. CONCLUSIONS 


Using maximum entropy methods, we have shown how 
to construct maximum entropy estimates of PDFs by 
using adaptive hierarchical transformation functions to 
record various marginal PDFs of the data, which we call 
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an “Adaptive Cluster Expansion’ (ACE). This method 
is a member of the same family as the trainable MRF 
known as the Boltzmann Machine, but it uses sophisti- 
cated transformations of the input data rather than hid- 
den variables to characterise the high order statistical 
properties of the training set. The simulations in this pa- 
per use hierarchical topographic mappings to build these 
transformations, but this is a convenience, not a neces- 
sity. 

We have also shown how to extend ACE so that it 
can be applied to translation invariant image processing, 
such as the detection of statistical anomalies in otherwise 
statistically homogeneous textures. Our methods show 
great promise, not only because they are amenable to a 
full theoretical analysis leading to closed-form maximum 
entropy solutions, but also because they lead directly to 
a modular system design which can locate anomalies in 
textures. 

We have presented several examples where ACE suc- 
cessfully detects anomalous regions in otherwise statisti- 
cally homogeneous textures. In all cases ACE adaptively 
extracts the global statistics of an image at various length 
scales during the unsupervised training, which takes 20 
seconds (on a VAXStation 3100) for the 8 layer ACE 
network that we applied to this problem. ACE then uses 
these statistics to form an output image that represents 
the probability that each local patch of the input im- 
age belongs to the ensemble of patches presented during 
training. We call this a “probability image”. 

Some possible applications of our results are as fol- 
lows. Inspection of textiles: this relies on the assumed 
statistical homogeneity of an unflawed piece of textile, so 
that faults show up as anomalies, which we have demon- 
strated successfully in this paper. Detection of targets 
in noisy background clutter in radar images: this is ba- 
sically a noisy version of the textile inspection problem, 
which goes somewhat beyond what we have presented 
in this paper, because it needs to address the problem 
of the noise entropy saturating ACE. Texture segmenta- 
tion: this is an ambitious goal which requires much fur- 
ther analysis in order to derive a computationally cheap 
method of handling multiple simultaneous textures. 


Appendix A: Vector quantisation 


In this appendix we summarise the hierarchical vec- 
tor quantisation method that we presented in detail in 
[19]. In this paper we use this technique to optimise the 
inter-layer mappings in Figure 7. We have applied this 
technique elsewhere to image compression [20], and mul- 
tilayer self-organising neural networks [17, 18, 22, 23]. 


1. Standard vector quantisation 


This subsection contains those details of the theory of 
standard vector quantisation that one needs to under- 


stand before proceeding to the modified vector quantisa- 
tion scheme that we present in Section A 2. 

The problem is to form a coding y of a vector x in such 
a way that a good estimate x’ of x can be constructed 
from knowledge of y alone. The sketch derivation in this 
section is presented in greater detail in [18]. Thus a vec- 
tor quantiser is constructed by minimising a Euclidean 
distortion D, with respect to the choice of coding func- 
tion y(a) and decoding function x’(y), where 


Dy = f de P(e) |je—2'(y(@)|? (AN 


reconstruction 


ya’ 


Figure 17: Encoding and decoding in a vector quantiser. 


We may represent the encoding and decoding opera- 
tions diagrammatically as shown in Figure 17. By func- 
tionally differentiating D, with respect to y(x) and ax’ (y) 
we obtain 


OD 2 x ce x'(y) — al" 
er P(x) ~—|la"(y) — 2| Logs 


dD, ' 
rey = 2. de Pl@) Stu ula) (a"(u) — 23) 


(A2) 


Setting oP =) = 0 in Equation A2 yields the optimum 


encoding function 


™ |la — a! (y(a)) | (A4) 


mi 

y(x) = arg 
which is called “nearest neighbour encoding”. Setting 
ary = 0 in Equation A3 yields the optimum decod- 
ing function 


a (y) = £42 Pe) Sty = la) 
J de PCa) 5(y — y(x)) 


(A5) 


which is the update scheme derived in [10]. Alternatively, 
we may use an incremental scheme to optimise the de- 
coding function by following the path of steepest descent 
which we may obtain from Equation A3 as 


ba’(y) = €d(y— y(x)) (a2 —a’(y)) where 0 <e <1. 

(A6) 
An iterative optimisation scheme may be formed by al- 
ternately applying Equation A4 and then either Equation 
A5 or Equation A6. This scheme will alternately improve 
the encoding and decoding functions until a local mini- 
mum distortion is located. Alternating Equation A4 and 
Equation A5 is commonly called the “LBG” (after the 
authors of [10]) or “k-means” algorithm. 
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2. Noisy vector quantisation 


This subsection contains the theoretical details of the 
optimisation of inter-layer mappings that we use in our 
numerical simulations in Section IV. Thus we generalise 
the results of Section A 1 to the case where the encoded 
version of the input vector is distorted by a noise process 
[4, 9, 19, 23]. 

Define a modified Euclidean distortion D2 as 


one / de P(«) / dn x(n) le — «'(y() +n)|? (AZ) 


We may represent the encoding and decoding opera- 
tions together with the noise process diagrammatically 
as shown in Figure 18, which is a trivially modified ver- 
sion of Figure 17. By functionally differentiating D2 with 
respect to y(ax) and a’(y) we obtain 


6D» 
dy(x) 


dD2 ! 
say = 2 f aPC) ly ul@)) (@'(u)- 2) (AS) 


Equation A8 is a “smeared” version of Equation A2, so 
Say = 0 does not lead to nearest neighbour encoding be- 
cause the distances to other code vectors have to be taken 
into account in order to minimise the damaging effect of 
the noise process. However, it is usually a good approx- 
imation to use the nearest neighbour encoding scheme 
shown in Equation A4. Setting EO) = 0 in Equation 
AQ yields the optimum decoding function 


_ [de P(w) my y(a)) @ 
J dx P(x) my — y(@)) 
which should be compared with Equation A5. Alterna- 


tively, we may obtain a steepest descent scheme in the 
form 


I 


(A8) 
y(@)+n 


P(e) f dnx(n) Fl) — el] 


w'(y) (A10) 


ba'(y) =er(y— y(x)) (wx —a'(y)) where0O<e<1 
(A11) 
which should be compared with Equation A6. 

As in Section A1, iterative optimisation schemes can 
be constructed in which we alternate the optimisation of 
the coding and decoding functions. Alternating Equation 
A4 (which approximately solves Poe) = 0) and Equation 
All yields the standard topographic mapping training 
algorithm [8], which is widely used in various forms in 
neural network simulations. 


3. Hierarchical vector quantisation 


In Figure 19 we show the simplest type of hierarchical 
vector quantiser. It consists of an inner quantiser con- 
tained in the dashed box, surrounded by a pair of outer 
quantisers. 


19 


If the part of the diagram contained in the dashed 
box were removed and direct connections made so that 
yi’ = yr and yo’ = yo, then Figure 19 would reduce to a 
pair of independent vector quantisers of the type shown 
in Figure 17. The dashed box contains a vector quan- 
tiser which encodes (y1, y2) to produce a code which it 
subsequently decodes to obtain (y1’, y2’). 

From the point of view of y; the effect of being passed 
through the inner quantiser is to modify y; thus y, —> 
yi’. A similar argument applies to yz —> yo’. The actual 
distortions y;’—y1 and y2' — y2 will be correlated in prac- 
tice, but we shall model them as if they were independent 
processes, and thus reduce Figure 19 to two independent 
vector quantisers of the type shown in Figure 18. 

This procedure can be extended to a hierarchical vec- 
tor quantiser with any number of levels of nesting. From 
the point of view of the quantisers at any level, we shall 
model the effect of the quantisers inwards from that level 
as independent distortion processes. It turns out not to 
be critically important what precise distortion model one 
uses, provided that it approximately represents the over- 
all scale of the distortion due to quantisation. 

In [19] we presented in detail a phenomenological dis- 
tortion model that we used to obtain an efficient train- 
ing procedure for topographic mappings and their appli- 
cation to hierarchical vector quantisers. Alternatively, 
the standard topographic mapping training procedure in 
[8] could be used, but this is a rather inefficient algo- 
rithm. The basic training procedure may be obtained 
from Equation A11 as 


1. Select a training vector x at random from the train- 
ing set. 


2. Encode x to produce y (= y(a)). 


3. For all y’ do the following: 


(a) Determine the corresponding code vector 
a'(y’). 

(b) Move the code vector x’(y’) directly towards 
the input vector x by a distance e(y — 


y(x)) |x — a'(y)|. 
4. Go to step 1. 


This cycle is repeated as often as is required to ensure 
convergence of the codebook of code vectors. 

The standard method [8] specifies that 7(y’—y) should 
be an even unimodal function whose width should be 
gradually decreased as training progresses. This allows 
coarse-grained organisation of the codebook to occur, fol- 
lowed progressively by ever more fine-grained organisa- 
tion, until finally the algorithm converges towards an op- 
timum codebook. 

In our own modification [19] of the standard method 
we replace a shrinking z(y’ — y) function acting on a 
fixed number of code vectors by a fixed 7(y’—y) function 
acting on an increasing number of code vectors. There 
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y! = a’ 


Figure 18: Encoding and decoding in a noisy vector quantiser. 


Figure 19: Encoding and decoding in a hierarchical vector quantiser. 


are many minor variations on this theme, but we find 
that it is sufficient to define 


y =y 
my —y)=4 € ly’ —y| =1 (A12) 
0 ly’—y|>1 


where we have absorbed € in Equation All into the def- 
inition of 7(y’ — y). We use a binary sequence of code- 
book sizes N = 2,4,8,16,32,---, where each codebook 
is initialised by interpolation from the next smaller code- 
book. We find that the following parameter values yield 
adequate convergence: € = 0.1, & = 0.05, and we per- 
form 20N training updates before doubling the value of 
N and progressing to the next larger size of codebook. 
The N = 2 codebook can be initialised using a random 
pair of vectors from the training set. 


Appendix B: Relationship to WISARD 


The second bracketed term in Equation 2.24 could be 
implemented in hardware as shown in Figure 20. This 
implementation assumes that the state vector x is quan- 
tised by representing each of its components using a fi- 
nite number of binary digits (bits). Note that we have 
taken advantage of the fact that we are discussing a single 
layer network in order to simplify the notation in Figure 
20 (as compared with Equation 2.24). The i-th block 
in this circuit is a random access memory (RAM) which 
records a transformation from a; (the address of an en- 
try in the RAM) to log P;(a,;) (the corresponding entry 
in the RAM). The data bus at the bottom of Figure 20 


carries the components of x (represented bitwise) to the 
relevant RAM. Note that each bit of a is used exactly 
once in forming addresses for the RAM, so the mapping 
from x to the set of addresses is bijective. The upper part 
of Figure 20 shows how the outputs are directed to an ac- 
cumulator where they are summed to form log Qmem (x). 


Figure 20 is a variant of the WISARD pattern recog- 
nition network [2]. The elements that our MEM solution 
and WISARD have in common are: a bijective mapping 
from the bits of an input state vector onto the address 
lines of a set of RAMs, and the accumulation of the out- 
puts of the RAMs to form the overall network output. 


However, there are some differences between the sin- 
gle layer ACE and the WISARD prescriptions for the 
contents of the RAMs. ACE specifies a set of functions 
(i.e. logarithms of marginal probabilities) to tabulate in 
the RAMs. Suppose that we truncate these table entries 
to a 1-bit representation, so that we use 0 to represent 
small logarithmic probabilities and 1 to represent large 
logarithmic probabilities. Each entry (i.e. 1 or 0) in the 
table then records whether the configuration of binary 
digits (i.e. the address of the entry) frequently occurs in 
the set of patterns corresponding to P(x) (the “training 
set”). The final output is therefore the total number of 
1’s that the input pattern addresses in the n tables. In 
effect, this is the total number of coincidences between 
configurations of bits in the input pattern and those in a 
predefined category. This 1-bit version of ACE is qual- 
itatively the same as the table look-up and summation 
operations performed in the simplest WISARD network, 
which completes the connection that we sought between 
ACE and WISARD. 
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log Q(a) 


Figure 20: Single layer ACE is WISARD. 
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Adaptive Cluster Expansion (ACE): A Multilayer Network for Estimating Probability 


Density Functions 


* 


S P Luttrell 
Defence Research Agency, Malvern 


We derive an adaptive hierarchical method of estimating high dimensional probability density 
functions. We call this method of density estimation the “adaptive cluster expansion”, or ACE for 
short. We present an application of this approach, based on a multilayer topographic mapping 
network, that adaptively estimates the joint probability density function of the pixel values of an 
image, and presents this result as a “probability image”. We apply this to the problem of identifying 
statistically anomalous regions in otherwise statistically homogeneous images. 


I. INTRODUCTION 


The purpose of this paper is to develop a novel type 
of adaptive network for estimating probability density 
functions (PDF) for use in Bayesian analysis [3, 5]. We 
consider only techniques that scale well for use in high 
dimensional spaces, such as the analysis of large ar- 
rays of pixels in image processing. There are many at- 
tempts to solve this type of density estimation problem. 
For instance, the Boltzmann machine [1] is essentially 
a trainable Gibbs distribution, which permits arbitrar- 
ily complicated statistical structure to be modelled via 
hidden variables. Unfortunately, this generality must be 
paid for by performing lengthy Monte Carlo simulations. 
There are various extensions to this technique, such as 
the higher order Boltzmann machine [13], which capture 
higher order statistical behaviour more economically, but 
none of these variations has been shown to be suitable 
for high-dimensional image processing problems. Using 
maximum entropy techniques [4], we develop a number of 
variations on the Gibbs distribution approach [10], and 
propose a scheme in which we replace simple interactions 
between a large number of hidden variables (as in the 
Boltzmann machine) by complicated interactions which 
directly model the statistical structure of the data; this 
is an extreme form of the approach taken in [13]. 

The novel adaptive density estimator that we develop 
in [10] is based on a multilayer network, in which we 
choose the layer-to-layer connections to be hierarchical, 
and the layer-to-layer transformations to be topographic 
mappings [6]; this adaptively transforms the input data 
into a multiscale “pyramid-like” format. In [10] we further 
propose that the joint PDF’s of adjacent nodes in each 
layer should be combined to form an estimate of the joint 
PDF of the nodes in the input layer. By analogy with the 
standard derivation of Gibbs distributions, we can also 
derive our joint PDF estimate by applying the maximum 
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entropy method [4]. However, our result is computation- 
ally much cheaper to implement than a standard Gibbs 
distribution, because we do not need to perform Monte 
Carlo simulations in order to integrate over the states of 
hidden variables. We suggest the name “adaptive cluster 
expansion” (ACE) for this type of network estimate of 
high-dimensional joint PDF’s. Other literature on this 
approach can be found in [7—-9, 11, 12], where we further 
develop multilayer topographic mapping networks, and 
their relationship to vector quantisers. 

The purpose of this paper is to present a complete ac- 
count of ACE, and to demonstrate its effectiveness when 
applied to the problem of density estimation. We do not 
dwell on the details of how to implement the topographic 
mapping training algorithm (we review this in the ap- 
pendix). In Section II we develop the ACE method of 
density estimation by appealing to simple counting argu- 
ments. In Section III] we demonstrate the power of ACE 
by applying it to the problem of estimating the joint PDF 
of the pixels of textured images selected from the Brodatz 
album [2]. 


II. PROBABILITY DENSITY FUNCTION 
ESTIMATION 


In this section we present a derivation of the ACE es- 
timate Q(x) of a PDF P(a). We develop this result by 
appealing to simple counting arguments and by using a 
diagrammatic language. 


A. Derivation of the ACE Estimate of a PDF 


We show a simple network in Figure 1, where the 
input space is 4-dimensional, the output space is 2- 
dimensional, and we factorise the feedforward transfor- 
mation as y(x) = (yi (#1, 2), yo(%3, %4)). Suppose we es- 
timate the joint PDF P5ut (yi, y2) of the outputs, and the 
joint PDFs Pini2(#1,2%2) and Py 34(x%3,24) of each pair 
of inputs, by measuring their histograms, for instance. 
Using this information alone we now wish to construct 
an estimate Q(x) of the true joint PDF P(a) of the 4- 
dimensional input. There are two alternative, but equiv- 
alent, ways of writing Q(a), each of which has its own 
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interesting interpretation. 


Q(x) = Pout (yi (71, 22), y2(w3, ©4)) 


In Equation 2.1 we construct Q(a) as follows. We use 
Pout (y1,y2) directly to estimate the joint PDF of the 
outputs, and indirectly to estimate the joint PDF of 
the inputs. In order to convert a PDF in output space 
(i.e. Pout(y1,y2)) into a PDF in input space we must 
divide Pout(y1, y2) by a compression factor equal to the 
number of input values that can produce the observed 
output value. Because we obtain y; and y2 separately 
from the pairs (21,22) and (a3,24), respectively, this 
compression factor is the product of two separate fac- 
tors. For instance, the compression factor corresponding 


Q(x) = Pin12(@1, £2) Pin,3a(%3, £4) 


which is trivially the same as Equation 2.1, but we 
have arranged its terms in a new way. This furnishes 
us with an alternative interpretation of Q(a). Thus, 
imagine that we are provided only with Pini2(x1, 22) 
and Pin34(%3,%4), and no information about the corre- 
lations between the pair (21,22) and the pair (x3, 24). 
This is sufficient for us to construct Q(a) as the product 
Pin i2 (#1, £2) Pin,ga(%3,04). Now, we admit that in fact 
we also know Pout:(yi, y2), which is a source of informa- 
tion about correlations between the pair (21, 72) and the 
pair (13,74). We make use of this information by form- 


; : F ; Pith : : 
ing the dimensionless ratio ,—!-#2)___ which differs 
Pout (y1) Pout (y2) 


from unity when y; and yg are correlated random vari- 
ables (ie. Pout(yi,y2) # Pout(y1) Pout(y2)). This ratio 
is greater (or less) than unity when the pair (yi, y2) is 
more (or less) likely to occur than would have been es- 
timated from knowledge of the marginal PDF’s Pyut(y1) 
and Pout(y2) alone. Finally, we use this dimensionless 
ratio as a correction factor to obtain the expression for 
Q(x) shown in Equation 2.2. This derivation is heuristic, 
but it leads to the same result as shown in Equation 2.1. 


In Figure 2 we present an alternative representation 
of the network in Figure 1, in which we emphasise the 
PDFs that we use to construct Q(a). Thus we introduce 
a shorthand notation in which we use an oval to highlight 
each clique of nodes in the network. We define the word 
“clique” to mean “complete set of nodes having the same 
parent node”. As is conventional when discussing tree- 


Firstly, we may write 


Pini2(t1,%2) ~Pin,3a(#3, 04) 


Pout (yi (21, %2)) Pout (y2(x3, v4)) 


Pout (ya (@1,02)) 
(Pin,12(@1,%2)) 
the average value of Piy12(%1, 22) over all the (2, x2) 
that produce the same value of y,;. However, we may 
refine this compression factor by using Pinj2(%1, 22) in- 
stead of (Pin12(%1,22)) in the denominator, to yield the 
Pont (y1(1,22)) 
Pin,12(@1,22) 
plied to obtain the compression factor corresponding to 
yg, and the results combined to obtain the final expres- 
sion for Q(x) as shown in Equation 2.1. 


to y1 is the ratio , where (Pin12(#1, %2)) is 


ratio . An analogous argument may be ap- 


Prut(yi(21, x2), y2(x3, £4)) 
Pout (y1 (#1, £2)) Pout (y2(x3, £4)) 


y; Yy 


es 


X X x xX 
1 2 3 4 


Figure 1: Basic 2-layer network. The input space (layer 0) is 
4-dimensional and the output space (layer 1) is 2-dimensional. 
The layer 0-to-1 transformation factorises into two indepen- 
dent transformations: y; depends only on (x1,x2), and y2 
depends only on (23, #4). 


like networks, we regard the higher layers of the network 
as being the ancestors of the lower layers, regardless of 
the fact that the direction of information flow is in the 
opposite direction through the tree. We then construct 
Q(x) as the product of the three clique PDF’s shown, 
whilst ensuring that the clique in layer 1 is appropriately 
normalised to render its contribution dimensionless. This 
leads to the form of Q(a) in Equation 2.2. 
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Figure 2: The clique PDFs that we use in the basic 2-layer 
network. Pout(yi, y2) is the joint PDF of the pair of network 


outputs, and Pout (y1) and Pout(y2) are its two marginal PDFs. 

Pout (y1,y2) 
Pout (y1) Pout (y2) 
tions between y; and yo. Pin.i2(@1,%2) and Pin34(v3, 24) are 
the joint PDFs of the pairs of inputs from which y; and y2 
derive, respectively. 


is a dimensionless ratio which records correla- 


This diagrammatic approach to constructing Q(a) 
may be readily extended to any tree-like feedforward net- 
work. We favour this approach, because the basic strat- 
egy for deriving Q(x) by invoking compression factors 
remains the same, but the burden of notational detail 
becomes somewhat heavy, so diagrams provide an ideal 
shortcut. For convenience, we summarise the prescrip- 
tion for constructing Q(x) from a tree-like diagram as 
follows: 


1. Estimate all of the clique PDFs, as histograms, for 
instance. 


2. Deduce all of the single-node marginal PDFs from 
the clique PDFs estimated in the previous step. For 
instance this would create Pout(y1) and Pout(ye) 
from Pout(yi,y2). This step is not needed in in 
layer 0. 


3. From the results estimated in the previous two 
steps, for each clique compute a clique factor as 
follows: 


(a) In the input layer the factor is the clique PDF 
itself. 
(b) In other layers the factor is the clique PDF 


divided by the product of its marginal PDF’s 
(e Pout (y1,y2) ) 
oe Pig (y1) Pout (y2) 7" 


4. Finally, to construct Q(a), form the product of all 
of the clique factors estimated in the previous step. 


B. Translation Invariance 


A disadvantage of the above prescription for construct- 
ing Q(x) is that it does not treat the components of x 
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on an equal footing. For instance, in Equation 2.2 we see 
that the pair (21, x2) is treated differently from the pair 
(a2,x3), even though both of these are pairs of adjacent 
components in the data. In order to solve this problem we 
construct a number of different tree-like networks, each of 
which breaks symmetry in its own peculiar way, and then 
we combine the results from each network to construct a 
composite Q(x) which respects the required symmetry. 


>RAAAAA 
ieee 
TAOS IN INGEN 


Figure 3: An example of the 4 separate 2-layer networks that 
we need to combine in order to produce a Q(a) that treats 
each component of a on an equal footing. Figure 3a shows the 
basic 2-layer network, Figure 3b shows the same network with 
the layer 1 clique PDF’s translated. Figure 3c and Figure 
3d derive from Figure 3a and Figure 3b by simultaneously 
translating the clique PDF’s in both network layers. 


In Figure 3 we show an example of the set of 4 differ- 
ent 2-layer networks which we need to combine in order 
to construct a composite Q(a). In this example we as- 
sume that the input is a high dimensional vector, so we 
can ignore edge effects. We replicate the basic network 
structure of Figure 2 across the input vector, as shown. 
Each of the 4 networks has its own set of clique PDFs 
(drawn as ovals in Figure 2), each of which leads to its 
own estimate Q(a) which breaks symmetry. However, a 
symmetric combination (such as the arithmetic or geo- 
metric mean) of these 4 results treats each component 
of x on an equal footing. We can verify this by not- 
ing that the set of cliques that contributes in the spatial 
neighbourhood of each component of x does not depend 
(apart from a trivial overall translation) on which com- 
ponent we select. 

We must select a prescription for forming the compos- 
ite Q(a). It needs only to be a symmetric combination of 
the 4 individual estimates that we show in Figure 3; the 
arithmetic mean and geometric mean are obvious choices. 
On pragmatic grounds, we choose to use the geometric 
mean, because it corresponds to the arithmetic mean of 
log Q(x), which is more convenient to perform in limited 
precision hardware (log Q(a) has a much smaller dynamic 
range than Q(a), assuming that we avoid the logarithmic 
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singularity). 
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Figure 4: Example of the composite network connectivity that 
we require in order for a single network to compute a com- 
posite Q(a), which treats each component of the input on 
an equal footing. This connectivity is the union of all of the 
binary trees can be generated from a reference binary tree 
(which we highlight in bold). 


In Figure 4 we show the connectivity of part of a 4- 
layer composite network that can be used to process the 
input data in preparation for constructing a composite 
Q(x). This connectivity contains all possible embedded 
tree-like networks, and in Figure 4 we highlight one such 
embedded tree for illustrative purposes. 

For an n-layer network, we form the composite Q(z) 
as the geometric mean over the Q(a) derived from all 
tree-like networks that are embedded in this composite 
network, to yield the geometric mean PDF Qgm(a) in 
the form 


n—1 


1 
log Qem(x) = S- 3E+1 S 7 log PE 
k 


L=0 


(2.3) 


where L sums over layers 0 to n — 1 of the network, k 
sums over cliques within a layer of the network, and PE 
is the clique PDF at position k in layer L. It is im- 
portant to note that the cliques are not simply adjacent 
nodes in each layer of the network. We must select pairs 
of nodes that form a “complete set of nodes having the 
same parent node”. In layer 0 this means that the nodes 
are adjacent. In layer 1 the nodes in a pair are separated 
by 1 intervening node. In layer 2 there are 3 intervening 
nodes, and so on. For L > 1 we must ensure that the P& 
are dimensionless by dividing out the marginal PDFs, as 
in Equation 2.2. The an factor ensures that we include 
each tree-like network exactly once, and that the final re- 
sult is indeed the geometric mean of these contributions. 
Figure 3 shows the terms that Equation 2.3 generates 
when we set n = 2. 

There are two further assumptions that we could make 
in order to simplify our result even further. Firstly, we 
could assume that the layer-to-layer transformations in 
Figure 4 were independent of position & within each layer 
L. Secondly, we could assume that the clique PDFs were 
independent of position k within each layer L. We can 
make both of these assumptions if the statistical prop- 
erties of the input data are known to be translationally 


invariant (such as might be the case for an image of a tex- 
ture, for instance). In all of our numerical simulations we 
make these two simplifying assumptions. 


C. Modular Implementation 


We now describe a practical implemention of Equation 
2.3 in the context of image processing (i.e. 2-dimensional 
arrays of pixels of data). There are three basic operations 
to perform. We must use a training set to determine 
suitable layer-to-layer transformations, then estimate the 
clique PDFs in each network layer, and then construct 
log Qem(a) from these estimates. Ideally we should op- 
timise the layer-to-layer transformations directly so that 
the constructed Qgm(ax) is “close to” P(a) in some sense 
(e.g. relative entropy), but we have not yet found a com- 
putationally cheap way of doing this. Instead, we tackle 
the problem indirectly, by using our existing multilayer 
topographic mapping network technique [8]. There are 
two main reasons for this choice. Firstly, this type of 
network is computationally cheap to train; we typically 
train such a network at the rate of 2.3 second per layer 
on a VAXstation 3100 workstation (assuming 6 bit data 
values). Secondly, the network encodes the input in such 
a way as to be able to reconstruct it approximately from 
the state of any network layer. Although this second 
property does not in general imply that the encoded in- 
put is the optimal one for constructing an estimate of 
the input PDF, it turns out that it does produce useful 
results. 

In Figure 5 we show a system for constructing Qem(), 
which consists of two interconnected subsystems - a mul- 
tilayer topographic mapping subsystem for transforming 
the input image, and a PDF estimation subsystem for 
forming an output image which contains the contribu- 
tions to Qem(x), each recorded in its spatially correct 
location in the image. For obvious reasons, we call the 
output image a “probability image”. The flow from left 
to right across the top half of Figure 5 implements the 
network structure in Figure 4, and the flow from right 
to left across the bottom half of Figure 5 progressively 
constructs the probability image. 

In Figure 5 the input image becomes layer 0 of a mul- 
tilayer network. In layer 0 we extract a pair of adja- 
cent pixels, and then pass it through a look-up table (or 
mapping) to yield a single value which we write into the 
appropriate pixel location in layer 1 (in Figure 1 this 
corresponds to transforming (21,22) to become y). We 
repeat this operation all over layer 0, to yield a whole 
array of transformed values in layer 1. There is an arbi- 
trariness in our choice of the relative position of the pairs 
of pixels that we use (e.g. north-south, or east-west, etc). 
In our simulations we use a north-south relative position 
in the layer 0-to-1 transformation, east-west in the layer 
1-to-2 transformation, and alternate these two choices 
thereafter as we progress from layer to layer of the net- 
work. Note also that the separation of the pairs of pixels 
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mapping 
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scale=1/4 
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Figure 5: First two layers of a modular system for constructing Qgm(«). The top half of the diagram is a multilayer network 
subsystem (actually, a multilayer topographic mapping in this case), which operates from left to right. The bottom half of 
the diagram is the PDF estimation subsystem, which operates from right to left. We connect the two systems by feeding 
logarithmic clique PDF’s measured in the multilayer network through to be added together in the PDF estimator. 


is not the same in each layer. In Figure 4 the separation 
doubles as we progress from layer to layer, but in Figure 
5 the separation doubles after every two layers, because 
we must allow both the east-west and the north-south 
orientations to be processed at all separations (this is 
a consequence of processing 2-dimensional data through 
a l-dimensional tree-structured network). If we concen- 
trate only on the topology of the network that results 
from this prescription in Figure 5, we discover that it is 
identical to the topology in Figure 4. Thus, the only dif- 
ference between these two cases is the way in which we 
identify the pixels of the input data array with the layer 
0 nodes. 

We may use any transformation that we wish in the 
look-up table. We have not yet discovered a compu- 
tationally cheap way of optimising the network in or- 
der to construct a Q(a) that best approximates the re- 
quired P(x). Instead, we optimise the network in such 
a way that each layer could be used to reconstruct ap- 
proximately the state of the previous layer. This is not 
the same optimisation problem, but it is computationally 


27 


very cheap, and empirically it leads to useful results for 
Q(x). We choose to train the network as a multilayer to- 
pographic mapping, which we implement in a look-up ta- 
ble after the training schedule has ended. Typically, the 
largest number of bits per pixel that we use is 8, which 
corresponds to a look-up table with 65536 (= 2?*°) sep- 
arate addresses, each containing an 8 bit output value. 


When we have trained a sufficient number of layers, 
we may estimate the clique PDF’s in each layer. We 
simply record these as histograms, without making any 
attempt to interpolate or smooth these estimates; later 
on we shall mention a number of caveats. This completes 
the left-to-right pass in the top half of Figure 5. 


In order to construct our geometric mean estimate 
Qem(x) of P(x), we must combine the estimates of the 
clique PDFs. We may obtain the result in Equation 2.3 
by appropriately scaling and summing the logarithms of 
the histograms (and their marginal histograms) in Fig- 
ure 5. The method that we use depends on the following 
rearrangement of Equation 2.3 
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log Qem(#) = 


in which we successively compute the contributions start- 
ing at network layer n — 1, and then work outwards to- 
wards layer 0. First of all we initialise all of the im- 
ages in the PDF estimation subsystem to some constant 
value (say zero), and then commence at layer n — 1 (i.e. 
the righthandmost layer in Figure 5). Using the no- 
tation of Figure 2, each clique in the multilayer topo- 
graphic mapping subsystem contributes a term of the 
form log Pout (M1, yo) — log Pout (y1) 4 log Pout (y2), which 
we add to the values stored in the two pixels that are lo- 
cated at the same clique position in the PDF estimation 
subsystem. In order to compensate for this double count- 
ing, and in order to account for the s factors that appear 
in Equation 2.4, we scale the logarithmic value by a fac- 
tor ; (= 4 x 3). We then progress layer by layer towards 
the left in Figure 5. At each layer we generate its loga- 
rithmic contribution as above, but now we add to this the 
contribution from the layer on its right, as shown in Fig- 
ure 5 and Equation 2.4. By cascading the results back- 
wards from layer to layer of the network, we iteratively 
construct log Qgm(a) in the form shown in Equation 2.4. 
Note that the layer 0 cliques are slightly different, because 
they contribute terms of the form log Pin12(#1, £2). 


When all of these stages are complete, the output im- 
age in Figure 5 contains pixel values whose sum equals 
the required log Qgm(x). The contribution to log Qgm(x) 
that is recorded in an output pixel derives from a (rect- 
angular) region in the input image that surrounds the 
location of the output pixel, so the output image can be 
interpreted as an image of correctly spatially registered 
logarithmic probability contributions to log Qem (2). 


In our simulations we investigate how each individual 
layer of the multilayer network contributes to log Qem(2x), 
so we switch off all except one of the sources of logarith- 
mic probability in Figure 5, which permits only a single 
layer of the network to contribute to the construction 
of the output image. Because each layer of the network 
typically is sensitive to statistical structure in the input 
image at only one length scale, the output image then 
typically reveals contributions to log Qgm(ax) at only one 
length scale. 


We should remark in passing that there are many other 
possible ways in which Figure 5 could be configured. Our 
results depend on an underlying tree-like structure, which 
we replicate to produce the translation invariant network 
in Figure 4, which we then use directly to produce the 
design in Figure 5. In the case of a non-binary tree we 
must be careful to produce the correct generalisation of 
Figure 4 and Figure 5, but there are no new difficulties 
in principle. 


1 n—3 1 nm—2 iE n—1 
9 ss log Pras ca 9 >» log Pros ae 9 Ss log ae 
kn—3 kn-2 kn-1 


D. Algorithmic Details 


We compensate for some of the effects of non-uniform 
illumination of the scene in the input image by adding 
a grey scale wedge whose gradient we choose in such 
a way as to remove the linear component of the non- 
uniformity. This improves the assumed translation in- 
variance of the image statistics. We do not attempt to 
perform a histogram equalisation on the input image, be- 
cause the transformation from network layer 0 to layer 1 
tends to perform this function anyway. In order not to 
disrupt the discussion, we review the details of the topo- 
graphic mapping training algorithm in the appendix. 

We choose to process the image in alternate directions 
using the following sequence: north/south, east /west, 
north/south, east/west, etc. This sequence leads to the 
following sequence of rectangular regions of the input im- 
age that influence the value in each pixel in each layer of 
the network: 1x2, 2x2,2x4, 4x4, etc, using (east /west, 
north/south) coordinates. In all of our experiments we 
use a 6-layer network, so the value in each pixel in the 
final layer is sensitive to an 8 x 8 region of the input 
image. 

The number of bits per pixel B that we use in each 
layer of the network determines the quality of the topo- 
graphic mappings (the B bit output from a topographic 
mapping is the index of the winner from amongst 2? com- 
peting “neurons”). Increasing B improves the quality of 
the mapping but increases the training time; we need to 
compromise between these two conflicting requirements. 
In our work on simple Brodatz texture images we find 
that choosing B to lie between 6 and 8 proves to be suf- 
ficient. Note that we choose to use the same number of 
bits per pixel in each layer of the network. In general 
this restriction is not necessary. 

It is important to note that for a given value of B there 
is an upper limit on the allowed entropy (per unit area) 
that the input data should have. A hierarchically con- 
nected multilayer topographic mapping network progres- 
sively squeezes the input data through an ever smaller 
bottleneck (in fact there are multiple parallel bottlenecks 
due to the overlapping tree structure) as we pass through 
the layers of the network. There is a upper limit to the 
number of network layers beyond which it simply cannot 
preserve information that is useful in estimating the joint 
density of the input data, which limits the capabilities of 
our current method. 

The choice of the size of the histogram bins is also im- 
portant. A property of the multilayer topographic map- 
ping network is that adjacent histogram bins derive from 
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input vectors that are close to each other (in the Eu- 
clidean sense), so it makes sense to rebin the histogram 
by adding together the contents of adjacent bins. We 
may easily control the histogram bin size by truncating 
the low order bits of each pixel value. If we truncate 
b low order bits of each pixel value, then effectively we 
smooth the histogram over 2° adjacent bins (for each di- 
mension of the histogram). As we smooth the histogram 
it will suffer from less noise, but we run the danger of 
smoothing away significant structure that might usefully 
be used to characterise the statistics of the input image; 
so we need to make a compromise. 

It is most important not to use histogram bins that 
are too small. A large number of small histogram bins 
would record the details of the statistical fluctuations of 
the training image (as particular realisations of a Poisson 
noise process in each bin), and would act as a detailed 
record of the structure in the training image, and thus 
be unable to generalise very well. Such histograms would 
look very spiky, and in extreme cases there might be 
counts recorded in only a few bins, with zeros in all of the 
remaining bins. If this situation were to occur, then the 
training image would have a large Qem(x), whereas a test 
image (having the same statistical properties) would have 
asmall Qem(x). The cause of this problem is the absence 
of a significant overlap between the spikes in the train- 
ing and test image histograms, which could be avoided 
by ensuring that the histogram bins are not too small. 
Generally, we find that a little experimentation can be 
used to determine a robust histogram binning strategy, 
so we do not attempt to implement a more sophisticated 
technique here. 

Finally, we display the contributions to log Qem(a) as 
follows. We determine the range of pixel values that oc- 
curs in the image, and we translate and scale this into the 
range [0,255]. This ensures that the smallest logarithmic 
probability appears as black, and the largest logarithmic 
probability appears as white, and all other values are 
linearly scaled onto intermediate levels of grey. This pre- 
scription has its dangers because each image determines 
its own special scaling, so one should be careful when 
comparing two different images. It can also be adversely 
affected by pixel value outliers arising from Poisson noise 
effects, where an extreme value of a single pixel could af- 
fect the way in which the whole of an image is displayed. 
However, we find that the overlapping tree structure of 
our multilayer network causes enough averaging together 
of individual contributions Q(z) that the composite re- 
sult Qem(x) does not suffer from problems due to pixel 
value outliers. 


III. APPLICATION TO THE DETECTION OF 
ANOMALIES IN TEXTURES 


In this section we present the results of applying the 
system shown in Figure 5 to four 256 x 256 images of 
textures taken from the Brodatz texture album [2]. In 
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all cases we compensate for uneven illumination by in- 
troducing a grey scale wedge as we explained earlier, we 
use 8 bits per pixel for the topographic mappings, we 
use 6 bits per pixel for histogramming, and we invert the 
[0, 255] scale to represent the contributions to log Qem (x) 
in such a way that white pixels indicate a small (rather 
than a large) contribution to log Qgm(a). Thus white 
pixels in the output image correspond to regions of the 
input image whose statistical properties differ markedly 
from the statistics averaged over the whole image. We 
usually call this representation of the contributions to 
log Qem(x) an “anomaly image”. 

We do not present these results as necessarily being an 
efficient way of detecting texture anomalies. Rather, we 
merely apply our novel method of estimating PDF’s, as 
expressed in Equation 2.3 and in Figure 5, to the partic- 
ular problem of texture analysis, because this is an effec- 
tive way of demonstrating some of the more interesting 


properties of log Qem(2). 


A. Texture 1 


In Figure 6 we show the first Brodatz texture image 
that we use in our experiments. The image is slightly 
unevenly illuminated and has a fairly low contrast, but 
nevertheless its statistical properties are almost transla- 
tion invariant. 


Figure 6: 256x256 image of Brodatz image 1. 


In Figure 7 we show the anomaly images that derive 
from Figure 6. Note how the anomaly images become 
smoother as we progress from Figure 7a to Figure 7f, 
due to the increasing amount of averaging that occurs 
amongst the overlapping trees in the network. 

Figure 7e and Figure 7f reveal a highly localised 
anomaly in the original image. Figure 7f corresponds 
to a length scale of 8 x 8 pixels, which is the approxi- 
mate size of the fault that is about ; of the way down 
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Figure 7: 256x256 anomaly images of Brodatz image 1. 


and slightly to the left of centre of Figure 6. The fault 
does not show up clearly on the other figures in Figure 
7 because their characteristic length scales are either too 
short or too long to be sensitive to the fault. 

From Figure 7 we conclude that ACE can easily pick 
out localised faults in highly ordered textures. 


B. Texture 2 


Figure 8: 256x256 image of Brodatz image 2. 


In Figure 8 we show the second Brodatz texture image 
that we use in our experiments. The image has a high 
contrast and translation invariant statistical properties. 


Figure 9: 256x256 anomaly images of Brodatz image 2. 


In Figure 9 we show the anomaly images that derive 
from Figure 8. The most interesting anomaly image is 
Figure 9f which shows several localised anomalies. About 
halfway down and to the left of centre of the image is an 
anomaly that corresponds to a dark spot on the thread 
in Figure 8. The brightest of the anomalies in the cluster 
just above the centre of the image corresponds to what 
appears to be a slightly torn thread in Figure 8. The 
other anomalies in this cluster are weaker, and corre- 
spond to slight distortions of the threads. There is an- 
other anomaly just below and to the right of the centre 
of Figure 9f, which corresponds to what appears to be 
another slightly torn thread in Figure 8. These anoma- 
lies all occur at, or around, a length scale of 8 x 8 pixels. 
Several of the anomaly images show an anomaly in the 
bottom left hand corner of the image, which corresponds 
to a small uniform patch of fabric in Figure 8. 


The results in Figure 9 corroborate the evidence in 
Figure 7 that we can train ACE to pick out localised 
anomalies in highly structured textures. This type of tex- 
ture could be analysed much more simply by model-based 
techniques that took advantage of their near-periodicity. 
However, that does not detract from the fact that, by 
making use of its adaptability, ACE succeeds in mod- 
elling these textures without prior knowledge of their 
near-periodicity. We seek a general purpose approach 
to density estimation; not a toolkit of different (usually 
model-based) techniques, each tuned to its own type of 
problem. 
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Figure 10: 256x256 image of Brodatz image 3. 


C. Texture 3 


In Figure 10 we show the third Brodatz texture im- 
age that use in our experiments. The image has a very 
high contrast and statistical properties that are almost 
translation invariant. However the density of anomalies 
is much higher than in either Figure 6 or Figure 8. 


Figure 11: 256256 anomaly images of Brodatz image 3. 


In Figure 11 we show the anomaly images that derive 
from Figure 10. At the lower left hand corner of Figure 
11f there is a large anomaly that corresponds to a region 
of Figure 10 that is distorted to the left. Figure 11f is 
sensitive to a length scale of 8 x 8 pixels, so it does not 
respond to this leftward distortion (which occurs on a 
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length scale of around 32 x 32 pixels), rather it responds 
to localised variations in the separations of the threads. 
There are numerous other anomalies in Figure 10; some 
are detected in Figure 11, and some are not. The ability 
of ACE to pick out anomalies degrades as the density 
of anomalies increases. This is because the anomalies 
themselves are part of the statistical properties that are 
extracted by ACE from the training image, and if a par- 
ticular type of anomaly occurs often enough in the image 
then it is no longer deemed to be an anomaly. In extreme 
cases there is also the possibility that the entropy (per 
unit area) of the input image can saturate ACE and thus 
degrade its performance, as we discussed earlier. 


D. Texture 4 


In this section we present a slightly different type of 
experiment in which we train ACE on one image and 
test ACE on another image. To create the two images 
we start with a single 256 x 256 image of a Brodatz tex- 
ture, which we divide into a left half and a right half. 
We then use the left half to construct a training image, 
and the right half to construct a test image. Note that 
in constructing these images we scrupulously avoid the 
possibility that the training and test images contain ele- 
ments deriving from a common source, although there are 
some small residual correlations between the two images 
along their common edge. 


Figure 12: 256x256 image of a Brodatz image of a carpet for 
training. 


In Figure 12 we show the training image which is a 
montage of two copies of the left hand half of a Brodatz 
texture image. Note that we use square, rather than 
rectangular, images because our software is restricted to 
processing this type of image. 

In Figure 13 we show the test image which is a montage 
of two copies of the right hand half of a Brodatz texture 
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Figure 13: 256x256 image of a Brodatz image of a carpet for 
testing. 


image, and superimposed on that is a 64 x 64 patch which 
we generated by flipping the rows and columns of a copy 
of the top left hand corner of this image. This patch is a 
hand-crafted anomaly. 


(a) 


(©) 


(e) 


Figure 14: 256x256 anomaly images of a Brodatz image of a 
carpet. 


In Figure 14 we show the anomaly images that derive 
from Figure 13 after we train on Figure 12. Figure 14f 
shows the strongest response to the anomalous patch in 
the centre of the image, corresponding to anomaly detec- 
tion on a length scale of 8 x 8 pixels. 


IV. CONCLUSIONS 


We present a novel method of density estimation in 
high-dimensional spaces, such as images. In Bayesian 
data processing there is a pressing need for a flexible way 
of constructing such estimates, because the basic objects 
that we manipulate in Bayesian analysis are joint PDFs, 
which we must somehow construct in the first place. We 
call the hierarchical network structure that emerges from 
our analysis an “Adaptive Cluster Expansion”, or ACE for 
short. 


ACE is computationally very cheap: we can train a 
multilayer topographic mapping network to estimate the 
joint PDF of its input data at the rate of 1 network layer 
every 2.3 second (on a VAXstation 3100, and assuming 
6 bits per pixel), where each layer analyses one length 
scale (power of 2) in the input data. We find, in our 
experiments with Brodatz textures, that 6 network lay- 
ers allows the detection of statistical anomalies in the 
textures. This result is not universal, because it must 
depend strongly on the scale at which the anomalous sta- 
tistical structure in the data is to be found. Although we 
demonstrate ACE only in a texture anomaly detection 
role, its scope is far greater than this. ACE is a general 
purpose, and computationally cheap, network for esti- 
mating densities in high-dimensional spaces. 


For completeness, we should mention that the perfor- 
mance of ACE in its current form has two fundamental 
limitations. Firstly, we assume that the network connec- 
tivity is fixed, and that its functionality is determined by 
a training algorithm. This restricts the possible statisti- 
cal properties of the input data that could be estimated. 
Secondly, ACE is based upon a hierarchically connected 
multilayer topographic mapping network, which progres- 
sively squeezes the input data through an ever smaller 
bottleneck as we pass through the layers of the network. 
There is a upper limit to the number of network layers 
beyond which ACE simply cannot preserve information 
that is useful in estimating the statistics of the input 
data. For instance, the statistics of an extremely noisy 
image of a texture can not be successfully estimated by 
ACE, because the noise entropy would saturate ACE be- 
fore the statistics of the underlying texture could be in- 
vestigated. This problem can be solved by introducing 
explicit noise models into ACE, which we shall report 
elsewhere. 


The standard topographic mapping training procedure 
in [6] is a rather inefficient algorithm. In [8] we present 
in detail an efficient training procedure for topographic 
mappings, and explain how to use it to train multilayer 
topographic mappings. 


For convenience, we introduce some notation. 
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x = input vector 

y = index of the winning “neuron” 

x(y) = reference vector associated with y 

m(y’ — y) = topographic neighbourhood function 
(normalised to unit total mass) 

E = update parameter used during training 

N = number of reference vectors 


1. Standard Topographic Mapping Training 
Algorithm 


The standard topographic mapping training procedure 
is essentially as follows [6]: 


1. Select a training vector x at random from the train- 
ing set. 


2. Map z to y by using a nearest neighbour prescrip- 
tion applied to the distance of « from each of the 
current set of reference vectors. 


3. For all y’, move the reference vector x(y’) directly 
towards the input vector a by a distance €7(y/ — 


y) jw — av(y') |]. 
4. Go to step 1. 


Repeat this loop as often as is required to ensure conver- 
gence of the reference vectors. 

The standard training method specifies that m(y’ — y) 
should be an even unimodal function whose width should 
be gradually decreased as training progresses. This al- 
lows coarse-grained organisation of the reference vectors 
to occur, followed progressively by ever more fine-grained 
organisation, until finally the algorithm converges to an 
optimum set of reference vectors. In a similar vein, the 
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relative size of the update step € should also be steadily 
decreased as training progresses. 


2. Modified Topographic Mapping Training 
Algorithm 


In our own modification [8] of the standard topographic 
mapping training we replace a shrinking 7(y’ — y) func- 
tion acting on a fixed number of reference vectors, by a 
fixed z(y’ — y) function acting on an increasing number 
of reference vectors. There are many minor variations on 
this theme, but we find that it is sufficient to define 


/ . yay 
my —y)=4 € iy — ul =1 
0 ler gy] 2 1 


where we absorb the ¢ into the definition of 7(y’—y). We 
increase the number of reference vectors in a binary se- 
quence (i.e. N = 2,4,8,16,32,---), and we initialise each 
generation of reference vectors by interpolation from the 
previous generation. We find that the following parame- 
ter values yield adequate convergence: ¢ = 0.1, e’ = 0.05, 
and we perform 20N training updates before doubling 
the value of N, as above. We initialise the N = 2 pair 
of reference vectors as a random pair of vectors chosen 
from the training set. 

In numerous experiments, we find that this modified 
form of the topographic mapping training algorithm con- 
verges much more rapidly than the standard method. 
Furthermore, the binary sequence of N values lends itself 
well to implementing the trained topographic mapping 
using a look-up table. 
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A Self-Organising Neural Network for Processing Data from Multiple Sensors 


* 


S P Luttrell 
Defence Research Agency, St Andrews Road, Malvern, Worcs, WR14 3PS, United Kingdom 


This paper shows how a folded Markov chain network can be applied to the problem of processing 
data from multiple sensors, with an emphasis on the special case of 2 sensors. It is necessary to design 
the network so that it can transform a high dimensional input vector into a posterior probability, 
for which purpose the partitioned mixture distribution network is ideally suited. The underlying 
theory is presented in detail, and a simple numerical simulation is given that shows the emergence 


of ocular dominance stripes. 


I. THEORY 


A. Neural Network Model 


In order to fix ideas, it is useful to give an explicit “neu- 
ral network” interpretation to the theory that will be de- 
veloped. The model will consist of 2 layers of nodes. The 
input layer has a “pattern of activity” that represents the 
components of the input vector x, and the output layer 
has a pattern of activity that is the collection of activ- 
ities of each output node. The activities in the output 
layer depend only on the activities in the input layer. If 
an input vector x is presented to this network, then each 
output node “fires” discretely at a rate that corresponds 
to its activity. After n nodes have fired the probabilis- 
tic description of the relationship between the input and 
output of the network is given by Pr (y1,y2,--: ,¥n|X), 
where y; is the location in the output layer (assumed to 
be on a rectangular lattice of size m) of the i*” node that 
fires. In this paper it will be assumed that the order 
in which the n nodes fire is not observed, in which case 
Pr (yi, Y2,°-* ; ¥Yn|X) is a sum of probabilities over all n! 
permutations of (y1,y2,°-* Yn), which is a symmetric 
function of the y;, by construction. 

The theory that is introduced in section IB concerns 
the special case n = 1. In the n = 1 case the proba- 
bilistic description Pr (y|x) is proportional to the firing 
rate of node y in response to input x. When n > 1 there 
is an indirect relationship between the probabilistic de- 
scription Pr (yi, y2,:-: , ¥n|x) and the firing rate of node 
y, which is given by the marginal probability 


m 


Pr (y|x) = ys Pr (y,y2,-°° »¥n|x) 


Y20 Yn 


(1.1) 


It is important to maintain this distinction between 
events that are observed (ie. (y1,Y¥2,°::;¥n) given 
x) and the probabilistic description of the events that 
are observed (ie. Pr(yi,y2,::: ,¥n|X)). The only 
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possible exception is in the n — oo limit, where 
Pr (yi, Y2,°°: ;¥n|X) has all of its probability concen- 
trated in the vicinity of those (yi, y2,--:,Y¥n) that are 
consistent with the observed long-term average firing rate 
of each node. It is essential to consider the n > 1 case to 
obtain the results that are described in this paper. 


B. Probabilistic Encoder /Decoder 


A theory of self-organising networks based on an anal- 
ysis of a probabilistic encoder/decoder was presented in 
[1]. It deals with the n = 1 case referred to in section I A. 
The objective function that needs to be minimised in or- 
der to optimise a network in this theory is the Euclidean 
distortion D defined as 


D= >, f axax Pr (x) Pr(y|x) Pr (x'ly) ||x — x’ ||? 
y=1 


(1.2) 
where x is an input vector, y is a coded version of x 
(a vector index on a d-dimensional rectangular lattice of 
size m), x’ is a reconstructed version of x from y, Pr (x) 
is the probability density of input vectors, Pr(y|x) is 
a probabilistic encoder, and Pr (x’|y) is a probabilistic 
decoder which is specified by Bayes’ theorem as 


_ Pr (y|x) Pr (x) 
FEY) > f dx’ Pr (y|x’) Pr (x’) (Ls) 


D can be rearranged into the form [1] 
. 2 
D=29> [dx Pr(x) Pr(ybs) [x-*' @)I? (1) 
y=1 


where the reference vectors x’ (y) are defined as 


x'(y)= je Pr (xly) x (1.5) 
Although equation 1.2 is symmetric with respect to in- 
terchanging the encoder and decoder, equation 1.4 is not. 
This is because Bayes’ theorem has made explicit the de- 
pendence of Pr (x|y) on Pr (y|x). From a neural network 
viewpoint Pr(y|x) describes the feed-forward transfor- 
mation from the input layer to the output layer, and 
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x’ (y) describes the feed-back transformation that is im- 
plied from the output layer to the input layer. The feed- 
back transformation is necessary to implement the objec- 
tive function that has been chosen here. 

Minimisation of D with respect to all free parameters 
leads to an optimal encoder/decoder. In equation 1.4 the 
Pr (y|x) are the only free parameters, because x’ (y) is 
fixed by equation 1.5. However, in practice, both Pr (y|x) 
and x’ (y) may be treated as free parameters [1], because 
x’ (y) satisfy equation 1.5 at stationary points of D with 
respect to variation of x’ (y). 


C. Posterior Probability Model 


The probabilistic encoder/decoder requires an explicit 
functional form for the posterior probability Pr (y|x). A 
convenient expression is 


Q (xly) 
yet Q (xly’) 


where Q (x|y) > 0 can be regarded as a node “activity”, 
and yet P(y|x) = 1. Any non-negative function can 


Pr (y|x) = (1.6) 


0< Q(xly) < 
7 1 
1 + exp (—w (y)- x — b(y)) 


Q (xly) (1.7) 


where w(y) and b(y) are a weight vector and bias, re- 
spectively. 

A drawback to the use of equation 1.6 is that it does 
not permit it to scale well to input vectors that have a 
large dimensionality. This problem arises from the re- 
stricted functional form allowed for Q (xy). A solution 


m 


2 
per, Ss [ew Pr (x) Pr (y1,¥2,-++¥nlx) [Ix — x! (y1.¥25°-- Yn 


YiY2Yn=1 


where the reference vectors x’ (yi, y2,--: Yn) are defined 
as 


x! (1.¥2¥n) = f dx Pr (x|y,,¥2,°*:Yn) x (1.11) 


The dependence of  Pr(yi,y2,---yn|x) and 
x’ (¥1,¥2,°°'Yn) on n output node locations com- 
plicates this result. Assume that Pr (y1, y2,-+-yn|x) is 
a symmetric function of its (y1,y2,--:yn) arguments, 
which corresponds to ignoring the order in which the 
first n nodes choose to fire (i.e. Pr(yi,y2,--:¥n|x) is 
a sum over all permutations of (y1,y2,-°-:yn)). For 
simplicity, assume that the nodes fire independently so 


was presented in [2] 


1 
y"EN(y’) Q (xly”) 
(1.8) 
where M = m1 mz2--- ma, and N (y) is a set of lattice 
points that are deemed to be “in the neighbourhood of” 
the lattice point y, and N (y) is the inverse neighbour- 
hood defined as the set of lattice points that have lat- 
tice point y in their neighbourhood. This expression for 
Pr (y|x) satisfies )7)"_, P (y|x) = 1 (see appendix A). It 
is convenient to define 


Pry) = 779) = 


y'EN(y) 


Q (xly) 
y"EN(y’) Q (xly”) 


Pr (y|x;y’) = 3 (1.9) 


which is another posterior probability, by construction. 
It includes the effect of the output nodes that are in the 
neighbourhood of node y’ only. Pr(y|x;y’) is thus a 
localised posterior probability derived from a localised 
subset of the node activities. This allows equation 1.8 
to be written as Pr(y|x) = 74 Uyen(y) Pr (ylxsy’), so 
Pr (y|x) is the average of the posterior probabilities at 
node y arising from each of the localised subsets that 
happens to include node y. 
D. Multiple Firing Model 


The model may be extended to the case where 
nm output nodes fire. Pr(y|x) is then replaced 
by Pr(yi,y2,°°:Yn|x), which is the probability that 
(y1,¥2,°"*Yn) are the first n nodes to fire (in that or- 
der). With this modification, D becomes 


(1.10) 


that Pr (yi, y2|x) = Pr(yi|x) Pr (y2|x) (see appendix 
B for the general case where Pr(yi,y2|x) does not 
factorise). D may be shown to satisfy the inequality 
D < Di + Dg (see appendix B), where 

De 2 
Die =) [ dxPr (x) Pr (ylx) Ix — x’ (y)|> (1.12) 


n 
y=1 


Zin) / dx Pr (x) 


2 


Dg = 
n 


S> Pr (y|x) (x — x’ (y)) 


D, and D2 are both non-negative. D; — 0 asn > 
co, and Dz = 0 when n = 0, so the D; term is the 
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sole contribution to the upper bound when n = 0, and 
the D2 term provides the dominant contribution as n > 
oo. The difference between the D; and the Dz terms is 
the location of the }7¥"_, Pr(y|x) (---) average: in the 
Dz term it averages a vector quantity, whereas in the 
Dj, term it averages a Euclidean distance. The D2 term 
will therefore exhibit interference effects, whereas the D; 
term will not. 


E. Probability Leakage 


The model may be further extended to the case where 
the probability that a node fires is a weighted average of 
the underlying probabilities that the nodes in its vicinity 
fire. Thus Pr (y|x) becomes 


Pr (y|x) + > Pr(yly’) Pr(y’|x) (1.13) 


where Pr (y|y’) is the conditional probability that node 
y fires given that node y’ would have liked to fire. In a 
sense, Pr (y|y’) describes a “leakage” of probability from 


Q(xly) 
yen(y’) Oxy)” 

In order to ensure that the model is truly scalable, it is 
necessary to restrict the dimensionality of the reference 
vectors. In equation 1.14 dimx’(y) = dimx, which is 
not acceptable in a scalable network. In practice, it will 
be assumed any properties of node y that are vectors in 
input space will be limited to occupy an “input window” 
of restricted size that is centred on node y. This restric- 
tion applies to the node reference vector x’ (y), which 
prevents D; + D2 from being fully minimised, because 
x’ (y) is allowed to move only in a subspace of the full- 
dimensional input space. However, useful results can nev- 
ertheless be obtained, so this restriction is acceptable. 


where Pr (y|x; y’) = 


node y’ that onto node y. Pr (y|y’) then plays the role of 
a soft “neighbourhood function” for node y’. This expres- 
sion for Pr (y|x) can be used wherever a plain Pr (y|x) 
has been used before. The main purpose of introducing 
leakage is to encourage neighbouring nodes to perform a 
similar function. This occurs because the effect of leak- 
age is to soften the posterior probability Pr (y|x), and 
thus reduce the ability to reconstruct x accurately from 
knowledge of y, which thus increases the average Eu- 
clidean distortion D. To reduce the damage that leak- 
age causes, the optimisation must ensure that nodes that 
leak probability onto each other have similar properties, 
so that it does not matter much that they leak. 


F. The Model 


The focus of this paper is on minimisation of the up- 
per bound D; + D2 (see equation 1.12) to D in the mul- 
tiple firing model, using a scalable posterior probability 
Pr (y|x) (see equation 1.8), with the effect of activity 
leakage Pr (y|y’) taken into account (see equation 1.13). 
Gathering all of these pieces together yields 


(1.14) 


G. Optimisation 


Optimisation is achieved by minimising D, + D2 with 
respect to its free parameters. Thus the derivatives with 
respect to x’ (y) are given by 


OD, 4 

x! (y) a ae i Pr (x) fi (x,y) (1.15) 
oP = 4 (n— 1) x Pr (x x 

Ox’ (y) nM? fe Pr (x) fo (x,y) 


and the variations with respect to Q (x|y) are given by 


5D, = — y/o Pr (x) gi (x,y) 6 log Q (xly) 
5D, = sae fe Pr (x) go (x,y) dO log Q (xly) (1.16) 
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The functions f; (x,y), fo (x,y), gi (x,y), and g(x,y) are derived in appendix C. Inserting a sigmoidal function 
1 


Q (xly) = 


1+exp(—w(y)-x—b(y)) 


4(n 


Because all of the properties of node y that are vectors 
in input space (i.e. x’ (y) and w(y)) are assumed to be 
restricted to an input window centred on node y, the 
eventual result of evaluating the right hand sides of the 
above equations must be similarly restricted to the same 
input window. 


H. The Effect of the Euclidean Norm on 
Minimising D; + D2 


The expressions for D, and D2, and especially their 
derivatives, are fairly complicated, so an intuitive inter- 


Dy + Dy == f ax Pr(x) > Pr(ylx) |x’ (9)? - “ZY fax Pro 


The M and M? factors do not appear in this expression 
because Pr (y|x) is normalised to sum to unity. The first 
term (which derives from D,) is an incoherent sum (i.e. 
a sum of Euclidean distances), whereas the second term 
(which derives from D2) is a coherent sum (i.e. a sum 
of vectors). The first term contributes for all values of 
n, whereas the second term contributes only for n > 2, 
and dominates for n >> 1. In order to minimise the first 
term the ||x’ (y)||? like to be as large as possible for those 
nodes that have a large Pr (y|x). Since x’ (y) is the cen- 
troid of the probability density Pr (x|y), this implies that 
node y prefers to encode a region of input space that is 
as far as possible from the origin. This is a consequence 
of using a Euclidean distortion measure ||x — x! ||”, which 
has the dimensions of ||x||”, in the original definition of 
the distortion in equation 1.2. In order to minimise the 
second term the superposition of x’ (y) weighted by the 
Pr (y|x) likes to have as large a Euclidean norm as pos- 
sible. Thus the nodes co-operate amongst themselves to 
ensure that the nodes that have a large Pr (y|x) also have 


a large pee Pr (y|x) x’ (|. 


2 ee [ov Pr (x) g2 (x,y) (1—- Q(xly)) ( ) 


then yields the derivatives with respect to w(y) and b(y) as 


= a | Pr(x) 91 (x,y) (1— Q(xly)) & 


(1.17) 


pretation will now be presented. When D, + Dz is sta- 
tionary with respect to variations of x’(y) it may be 
written as (see appendix D). 


2 
+ constant (1.18) 


>= Pr (y[x) x’ (y) 


I 
II. SOLVABLE ANALYTIC MODEL 


The purpose of this section is to work through a case 
study in order to demonstrate the various properties that 
emerge when D, + Dz is minimised. 


A. The Model 


It convenient to begin by ignoring the effects of leakage 
Pr (yly’), and to concentrate on a simple (non-scaling) 
version of the posterior probability model (as in equa- 


tion 1.6) Pr (y|x) = Oey) where the Q (x|y) are 


threshold functions of x 


0 below threshold 


Q (xly) = { 1 above threshold en) 


It is also convenient to imagine that a hypothetical 
infinite-sized training set is available, so it may be de- 
scribed by a probability density Pr(x). This is a “fre- 
quentist”, rather than a “Bayesian”, use of the Pr (x) no- 
tation, but the distinction is not important in the context 


38 


A Self-Organising Neural Network for Processing Data from Multiple Sensors 5 


of this paper. Assume that x = (x1,x2) is drawn from 
a training set, that has 2 statistically independent sub- 
spaces, so that 

Pr (x1, X2) = Pr (x1) Pr (x2) (2.2) 
Furthermore, assume that Pr (x,) and Pr (x2) each have 
the form 


(2.3) 


i.e. Pr (x;) is a loop (parameterised by a phase angle 6;) 
of probability density that sits in x,;-space. In order to 
make it easy to deduce the optimum reference vectors, 
choose x; (0;) so that the following 2 conditions are sat- 
isfied for i = 1,2 


|x: (0;)||" = constant 
4 (OH; ‘ 
me ) = constant (2.4) 


n fax dxg Pr (x1,Xe2ly) (x1, x2) = 


m 


(n= 1) > { f de dies Pr (xs, maly) Pr (ys x) (0) 2h (1) + OX (0) 28 0) 


a 


y’=1 


| 


Figure 1: Representation of S* x $* topology with a threshold 
Q (ly) superimposed. 


It is useful to use the simple diagrammatic notation 
shown in figure 1. Each circle in figure 1 represents one 
of the $1 subspaces, so the two circles together represent 
the product $1 x $1. The constraints in equation 2.4 are 
represented by each circle being centred on the origin of 
its subspace (||x; (0;)||” is constant), and the probabil- 


; 2 
ity density around each circle being constant (| ae 


is constant). A single threshold function Q (x|y) is rep- 
resented by a chord cutting through each circle (with 0 
and 1 indicating on which side of the chord the thresh- 
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This type of training set can be visualised topologically. 
Each training vector (x1,x2) consists of 2 subvectors, 
each of which is parameterised by a phase angle, and 
which therefore lives in a subspace that has the topol- 
ogy of a circle, which is denoted as St. Because of 
the independence assumption in equation 2.2, the pair 
(X1,X2) lives on the surface of a 2-torus, which is de- 
noted as S' x S!. The minimisation of D; + D2 thus 
reduces to finding the optimum way of designing an en- 
coder/decoder for input vectors that live on a 2-torus, 
with the proviso that their probability density is uni- 
form (this follows from equation 2.3 and equation 2.4). 


In order to derive the reference vectors x’ (y), the solu- 
0(D1+D2) 
Ox’ (y) 
be computed. The stationarity condition reduces to (see 

appendix D) 


tion(s) of the stationarity condition = 0 must 


Figure 2: Explicit representation of S' x $1 topology as a 
torus with the effect of 3 different types of threshold Q (aly) 
shown. 


old is triggered). The x; that lie above threshold in 
each subspace are highlighted. Both x; and x2 must 
lie above threshold in order to ensure Q (x|y) = 1, ie. 
they must both lie within regions that are highlighted 
in figure 1. In this case node y will be said to be “at- 
tached” to both subspace 1 and subspace 2. A special 
case arises when the chord in one of the subspaces (say 
it is x2) does not intersect the circle at all, and the cir- 
cle lies on the side of the chord where the threshold is 
triggered. In this case Q (x|y) does not depend on x2, so 
that Pr (y|x,,x2) = Pr (y|x,), in which case node y will 
be said to be “attached” to subspace 1 but “detached” 
from subspace 2. The typical ways in which a node be- 
comes attached to the 2-torus are shown in figure 2. In 
figure 2(a) the node is attached to one of the $1 sub- 
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Figure 3: 16 nodes are shown, which are all attached to sub- 
space 1, and all detached from subspace 2. 


spaces and detached from the other. In figure 2(b) the 
attached and detached subspaces are interchanged with 
respect to figure 2(a). In figure 2(c) the node is attached 
to both subspaces. 


n fax dxg Pr (xi|y) Pr (x2) (x1, x2) = 


[= dxa Pr (x:1|y) Pr (x2) { (nm — 1) D7 Pr (y!|x1) (x4 (y') x2 (y’)) + (1 (y) x2 (y)) 


Figure 4: 16 nodes are shown, which are all attached to both 
subspace 1 and subspace 2. 


whence 
x(y) = f dx. Pr aly) > 
x9 (y) = 0 (2.7) 


n Pr (v) [a dxg Pr (x1,Xely) (x1, x2) = 


Pr (y) if dx, dX2 Pr (x1, X2|y) 


(n — 1) S$) Pr (y'[xy, 2) (x4 (y’) x4 (y’)) + (x4 (y) 4 (y)) 


t=] 


B. All Nodes Attached to One Subspace 


Consider the configuration of threshold functions 
shown in figure 3. This is equivalent to all of the nodes 
being attached to loops to cover the 2-torus, with a typ- 
ical node being as shown in figure 2(a) (or, equivalently, 
figure 2(b)). When D; + D2 is minimised, it is assumed 
that the 4 nodes are symmetrically disposed in subspace 
1, as shown. Each is triggered if and only if x, lies within 
its quadrant, and one such quadrant is highlighted in fig- 
ure 3. This implies that only 1 node is triggered at a 
time. The assumed form of the threshold functions im- 
plies Pr (y|x,,x2) = Pr(y|x,), so equation 2.5 reduces 
to 


M 


ots 


y 


C. All Nodes Attached to Both Subspaces 


Consider the configuration of threshold functions 
shown in figure 4. This is equivalent to all of the nodes 
being attached to patches to cover the 2-torus, with a 
typical node being as shown in figure 2(c). In this case, 
when D,; + Dg is minimised, it is assumed that each sub- 
space is split into 2 halves. This requires a total of 4 
nodes, each of which is triggered if, and only if, both x, 
and x2 lie on the corresponding half-circles. This implies 
that only 1 node is triggered at a time. The assumed 
form of the threshold functions implies that the station- 
arity condition becomes 
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D. Half the Nodes Attached to One Subspace, and 
Half to the Other Subspace 


Consider the configuration of threshold functions 
shown in figure 5. This is equivalent to half of the nodes 
being attached to loops to cover the 2-torus, with a typi- 
cal node being as shown in figure 2(a). The other half of 
the nodes would then be attached in an analogous way, 
but as shown in figure 2(b). Thus the 2-torus is covered 
twice over. In this case, when D, + D2 is minimised, it is 
assumed that each subspace is split into 2 halves. This 
requires a total of 4 nodes, each of which is triggered if 
X, (or X2) lies on the half-circle in the subspace to which 
the node is attached. Thus exactly 2 nodes y; (x;) and 
y2 (X2) are triggered at a time, so that 


1 
Figure 5: 16 nodes are shown, 8 of which are attached to Pr (y|X1, X2) aon (Outten rs Onan 
subspace 1 and detached from subspace 2 (top row), and 8 of : 
which are attached to subspace 2 and detached from subspace Spee 5 (Pr (y|x1) +Pr (y|x2)) (2.10) 
1 (bottom row). 
whence 
x1 (y) = / dx, Pr (xi|y) x1 For simplicity, assume that node y is attached to sub- 
space 1, then Pr(x1,x2|y) = Pr(xi|y) Pr (x2) and the 
x5 (y) = / dx Pr (x2|y) Xe (2.9) stationarity condition becomes 
| 
n Pr (v) [a dxg Pr (xily) Pr (x2) (x1, x2) = 
ote 
Pr(y) j dx dx2 Pr (x1|y)],Pr (x2) | —S— D2 (Pr (y'laea) + Pr (oxo) (ach (y') 85 (u') + et (y) x (y)) | (2-11) 
y= 
This may be simplified to yield 
n+1 n ae 
nf dx, Pr(xaily) (O61,0) = “E* Gxt (w) x5 (0) + 2S fxn Pra) YO Pru! bea) Ox, (V5 (9) 
y= 
n+1 ‘ mols; } 
= a 1 (Y) x2 (y)) + > OY) 1% Wo (2.12) 


Write the 2 subspaces separately (remember that node y is assumed to be attached to subspace 1) 


xi (y) = Pf dx Pr (say) 1 — 2 ho 
U as n-1 ! 
xh(y) = —2=S bh) (2.13) 
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If this result is simultaneously solved with the analogous result for node y attached to subspace 2, then the (---) 


terms vanish to yield 


0 


E. Compare D, + D2 for the 3 Different Types of 
Solution 


Consider the left hand side of figure 3 for the case of M 
nodes, when the M threshold functions form a regular M- 
ogon. Pr (x|y) then denotes the part of the circle that is 
associated with node y, whose radius of gyration squared 
is given by (assuming that the circle has unit radius) 


Ryu 


Ill 
aay 
a 
* 
- 

a 
pail 
cP 
* 


(2.15) 


Gather the results for (x{ (y),x4(y)) in equations 2.7 
(referred to as type 1), 2.9 (referred to as type 2), and 
2.14 (referred to as type 3) together and insert them into 
D, + Dg in equation 1.18 to obtain (see appendix E) 


constant — 2Ryq type 1 
D,+D, = ¢ constant — 4R jz type 2 (2.16) 
constant — a R M type 3 


In figure 6 the 3 solutions are plotted for the case n = 1. 
For n = 1 the type 3 solution is never optimal, the type 
1 solution is optimal for M < 19, and the type 2 solu- 
tion is optimal for M > 20. This behaviour is intuitively 
sensible, because a larger number of nodes is required to 
cover a 2-torus as shown in figure 2(c) than as shown in 
figure 2(a) (or figure 2(b)). 

In figure 7 the 3 solutions are plotted for the case n = 
2. For n = 2 the type 1 solution is optimal for M 
12, and the type 2 solution is optimal for large M 
30, but there is now an intermediate region 12 < M 
29 (type 1 and type 3 have an equal D; + Dz at M = 
12) where the n-dependence of the type 3 solution has 
now made it optimal. Again, this behaviour is intuitively 
reasonable, because the type 3 solution requires at least 2 
observations in order to be able to yield a small Euclidean 
resonstruction error in each of the 2 subspaces, i.e. for 
n = 2 the 2 nodes that fire must be attached to different 
subspaces. Note that in the type 3 solution the nodes 
that fire are not guaranteed to be attached to different 
subspaces. In the type 3 solution there is a probability 


IA IV IA 


Gp 5 { an [dx Pr (xily) x1 


0 
a { an, f dx. Pr (x2|y) Xe 


y attached to subspace 1 
y attached to subspace 2 


y attached to subspace 1 


y attached to subspace 2 (2.14) 
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Figure 6: Plots of —D; — D2 for n = 1 for each of the 3 types 
of optimum. 


Se See that n; (where n = n; +n) nodes are attached 


to subspace 7, so the trend is for the type 3 solution to 
become more favoured as n is increased. 


In figure 8 the 3 solutions are plotted for the case n > 
co. For n + co the type 2 solution is never optimal , 
the type | solution is optimal for M < 8, and the type 
3 solution is optimal for M > 8. The type 2 solution 
approaches the type 3 solution from below asymptotically 
as M — oo. In figure 9 a phase diagram is given which 
shows how the relative stability of the 3 types of solution 
for different M and n, where the type 3 solution is seen 
to be optimal over a large part of the (M,n) plane. Thus 
the most interesting, and commonly occurring, solution 
is the one in which half the nodes are attached to one 
subspace and half to the other subspace (i.e. solution 
type 3). Although this result has been derived using 
the non-scaling version of the posterior probability model 
Pr (y|x) (as in equation 1.6), it may also be used for 
scaling posterior probabilities (as given in equation 1.8) 
in certain limiting cases, and also for cases where the 
effect of leakage Pr (yly’) is small. 
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Figure 7: Plots of —D, — D2 for n = 2 for each of the 3 types 
of optimum. 


—D,-D) 
At type 3 
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Figure 8: Plots of —D, — D2 for n > oo for each of the 3 
types of optimum. 


F. Various Extensions 
1. The Effect of Leakage 


The effect of leakage will not be analysed in detail here. 
However, its effect may readily be discussed phenomeno- 
logically, because the optimisation acts to minimise the 
damaging effect of leakage on the posterior probability 
by ensuring that the properties of nodes that are con- 
nected by leakage are similar. This has the most dra- 
matic effect on the type 3 solution, where the way in 
which the nodes are partitioned into 2 halves must be 
very carefully chosen in order to minimise the damage 
due to leakage. If the leakage is presumed to be a local 
function, so that Pr(yly’) = m(y—y’), which is a lo- 
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Figure 9: Phase diagram of the regions in which the 3 types 
of solution are optimal. 


calised “blob’-shaped function, then the properties of ad- 
jacent node are similar (after optimisation). Since nodes 
that are attached to 2 different subspaces necessarily have 
very different properties, whereas nodes that are attached 
to the same subspace can have similar properties, it fol- 
lows that the nodes ee split into 2 continguous halves, 


where nodes 1,2,--- , = are attached to subspace 1 and 


nodes M +1, M + 2,---,M are attached to subspace 2, 
or vice versa. The effect of leakage is thereby minimised, 
with the worst effect occurring at the boundary between 
the 2 halves of nodes. 


2. Modifying the Posterior Probability to Become Scalable 


The above analysis has focussed on the non-scaling ver- 
sion of the posterior probability, in which all MW nodes act 
together as a unit. The more general scaling case where 
the M nodes are split up by the effect of the neighbour- 
hood function N (y) will not be analysed in detail, be- 
cause many of its properties are essentially the same as 
in the non-scaling case. For simplicity assume that the 
neighbourhood function N (y) is a “top-hat” with width 
w (an odd integer) centred on y. Impose periodic bound- 
ary conditions so that the inverse neighbourhood func- 
tion N (y) is also a top-hat, N(y) = N(y). In this 
case an optimum solution in the non-scaling case (with 
M = w) can be directly related to a corresponding op- 
timum solution in the scaling case by simply repeating 
the node properties periodically every w nodes. Strictly 
speaking, higher order periodicities can also occur in the 
scaling case (and can be favoured under certain condi- 
tions), where the period is 7 (k is an integer), but these 
will not be discussed here. 

The effect of the periodic replication of node prop- 
erties is interesting. The type 3 solution (with leak- 
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age and with M = w) splits the nodes into 2 halves, 
where nodes 1,2,--- , 4 are attached to subspace 1 and 
nodes 5 + 1,5 +2,---,w are attached to subspace 2, 
or vice versa. When this is replicated periodically ev- 
ery w nodes it produces an alternating structure of node 
properties, where nodes are attached to subspace 1, 
then the next 3 nodes are attached to subspace 2, and 
thenthe next 4 nodes are attached to subspace 1, and so 
on. This behaviour is reminiscent of the so-called “domi- 
nance stripes” that are observed in the mammalian visual 


cortex. 


Ill. EXPERIMENT 


The purpose of this section is to demonstrate the emer- 
gence of the dominance stripes in numerical simulations. 
The main body of the software is concerned with evalu- 
ating the derivatives of D, + D2, and the main difficulty 
is choosing an appropriate form for the leakage (this has 
not yet been automated). 


A. The Parameters 


The parameters that are required for a simulation are 
as follows: 


1. (m1,mz2): size of 2D rectangular array of nodes. 
M= myzmMms. 


2. (41,72): size of 2D rectangular input window for 
each node (odd integers). Ensure that the input 
window is not too many input data “correlation ar- 
eas” in size, otherwise dominance stripes may not 
emerge. Dominance stripes require that the cor- 
relation within an input window are substantially 
stronger than the correlations between input win- 
dows that are attached to different subspaces. 


3. (wi,W2): size of 2D rectangular neighbourhood 
window for each node (odd integers). The neigh- 
bourhood function N (y1, y2) is a rectangular top- 
hat centred on (yi, y2). The size of the neighbour- 
hood window has to lie within a limited range to 
ensure that dominance stripes are produced. This 
corresponds to ensuring that M lies in the type 3 
region of the phase diagram in figure 9. It is also 
preferable for the size of the neighbourhood window 
to be substantially smaller than the input window, 
otherwise different parts of a neighbourhood win- 
dow will see different parts of the input data, which 
will make the network behaviour more difficult to 
interpret. 


4. (ly, ly): size of 2D rectangular leakage window for 
each node (odd integers). For simplicity the leak- 
age Pr (yly’) is assumed to be given by Pr (yly’) = 
mw (y —y’), where m(y — y’) is a “top-hat” function 
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of y — y’ which covers a rectangular region of size 
(1, l2) centred on y—y’ = 0. The size of the leakage 
window must be large enough to correlate the pa- 
rameters of adjacent nodes, but not so large that it 
enforces such strong correlations between the node 
parameters that it destroys dominance stripes. 


. v: additive noise level used to corrupt each member 


of the training set. 


. &: wavenumber of sinusoids used in the training set. 


In describing the training sets the index y will be 
used to denote position in input space, thus posi- 
tion y in input space lies directly “under” node y of 
the network. In 1D simulations each training vec- 
tor is a sinusoid of the form sin (ky + ¢) +7, where 
g is arandom phase angle, and r is a random num- 
ber sampled uniformly from the interval [-%, al 
— this generates an $1 topology training set (i.e. 
parameterised by 1 random angle). In 2D simula- 
tions each training vector is a sinusoid of the form 
sin (K (y1 cos + yo sin@) + ¢) +r, where the addi- 
tional angle 6 is a random azimuthal orientation for 
the sine wave — this generates an $1 x S! topology 
training set (i.e. parameterised by 2 independent 
random angles). Note that Ki; and Ki2 must be an 
integer multiple of 27 in order to ensure that the 
probability density around the $1 subspace gener- 
ated by # has uniform density (in effect, the St then 
becomes a circular Lissajous figure, which therefore 
has uniform probability density, unlike non-circular 
Lissajous figures), and thus to ensure that there are 
no artefacts induced by the periodicity of the train- 
ing data that might mimic the effect of dominance 
stripes. If #2; and «ig are much greater than 27 
then it is not necessary to fix them to be integer 
multiples of 27 — because the fluctuations in the 
probability density are then negligible. Note that 
this restriction on the value of « would not have 
been necessary had complex exponentials been used 
rather than sinusoids. 


. 8: number of subspaces. This fixes the number of 


statistically independent subspaces in the training 
set. When s = 1 the training set is generated ex- 
actly as above. When s = 2 the training set is 
split up as follows. The 1D case has even y in one 
subspace, and odd y in the other subspace, thus 
successive components of each training vector al- 
ternate between the 2 subspaces. The 2D case has 
even yi + y2 in one subspace, and odd yi + y2 in the 
other subspace, thus each training vector is split up 
into a chessboard pattern of interlocking subspaces. 
This strategy readily generalises for s > 3, although 
this is not used here. Within each subspace the 
training vector is generated as above, and the sub- 
spaces are generated so that they are statistically 
independent. 
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8. €: update parameter used in gradient descent. This 
is used to update parameters thus 


O(D, + D2) 


parameter — parameter — € 
Oparameter 


(3.1) 


There are 3 internally generated update parameter, 
which control the update of the 3 different types of 
parameter, i.e. the biases, the weights, and the 
reference vectors. This is necessary because these 
parameters all have different dimensionalities, and 
by inspection of equation 3.1 the dimensionality of 
an update parameter is the dimensionality of the 
parameter it updates (squared) divided by the di- 
mensionality of the Euclidean distortion. These 3 
internal parameters are automatically adjusted to 
ensure that the average change in absolute value of 
each of the 3 types of parameter is equal to ¢ times 
the typical diameter of the region of parameter 
space populated by the parameters. This adjust- 
ment is made anew as each training vector is pre- 
sented. The size of ¢ determines the “memory time” 
of the node parameters. This memory time deter- 
mines the effective number of training vectors that 
the nodes are being optimised against, and thus 
must be sufficiently long (i.e. € sufficiently small) 
that if s > 2 it is possible to discern that the sub- 
spaces are indeed statistically independent. This is 
crucially important, for dominance stripes cannot 
be obtained if the subspaces are not sufficiently sta- 
tistically independent. So ¢ must be small, which 
unfortunately leads to correspondingly long train- 
ing times. 


B. Initialisation 


The training set is globally translated and scaled so 
that the components of all of its training vectors lie in 
the interval [—1,+1]. There are 3 parameter types to ini- 
tialise. The weights were all initialised to random num- 
bers sampled from a uniform distribution in the interval 
[—0.1, +0.1], whereas the biasses and the reference vector 
components were all initialised to 0. Because the 2D sim- 
ulations took a very long time to run, they were periodi- 
cally interrupted and the state of all the variables written 
to an output file. The simulation could then be continued 
by reading this output file in again and simply continuing 
where the simulation left off. Alternatively, some of the 
variables might have their values changed before contin- 
uing. In particular, the random number generator could 
thus be manipulated to simulate the effect of a finite sized 
training set (i.e. use the same random number seed at 
the start of each part of the simulation), or an infinite- 
sized training set (i.e. use a different random number 
seed at the start of each part of the simulation). The 
size of the ¢ parameter could also thus be manipulated 
should a large value be required initially, and reduced 
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to a small value later on, as required in order to guar- 
antee that when n > 2 the input subspaces are seen to 
be statistically independent, and dominance stripes may 
emerge. 


C. Boundary Conditions 


There are many ways to choose the boundary con- 
ditions. In the numerical simulations periodic bound- 
ary conditions will be avoided, because they can lead to 
artefacts in which the node parameters become topologi- 
cally trapped. For instance, in a 2D simulation, periodic 
boundary conditions imply that the nodes sit on a 2- 
torus. Leakage implies that the node parameter values 
are similar for adjacent nodes, which limits the freedom 
for the parameters to adjust their values on the surface 
of the 2-torus. For instance, any acceptable set of pa- 
rameters that sits on the 2-torus can be converted into 
another acceptable set by mapping the 2-torus to itself, 
so that each of its $+ “coils up” an integer number of times 
onto itself. Such a multiply wrapped parameter configu- 
ration is topologically trapped, and cannot be perturbed 
to its original form. This problem does not arise with 
non-periodic boundary conditions. 

There are several different problems that arise at the 
boundaries of the array of nodes: 


1. The neighbourhood function N (yi, y2) cannot be 
assumed to be a rectangular top-hat centred on 
(yi, y2)- Instead, it will simply be truncated so that 
it does not fall off the edge of array of nodes, i.e. 
N (yi, y2) = 0 for those (y1, y2) that lie outside the 
array. 


2. The leakage function 7 (y1 — yj, Yy2 — yo) will be 
similarly truncated. However, in this case 
w (yi — ¥4,Y2 — Yo) must normalise to unity when 
summed over (yi, y2), so the effect of the trunca- 
tion must be compensated by scaling the remaining 
elements of 7 (y1 — y}, Y2 — Yb): 


3. The input window for each node implies that the 
input array must be larger than the node array in 
order that the input windows never fall off the edge 
of the input array. 


D. Presentation of Results 


The most important result is the emergence of domi- 
nance stripes. For n = 2 there are thus 2 numbers that 
need to be displayed for each node: the “degree of at- 
tachment” to subspace 1, and similarly for subspace 2. 
There are many ways to measure degree of attachment, 
for instance the probability density Pr (x|y) gives a di- 
rect measurement of how strongly node y depends on 
the input vector x, so its “width” or “volume” in each 
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Figure 10: Dominance plots for a 1D simulation with 2 sta- 


tistically independent training set subspaces. 


of the subspaces could be used to measure degree of at- 
tachment. However, in the simulations presented here 
(ie. sinusoidal training vectors) the degree of attach- 
ment is measured as the average of the absolute values of 
the components of the reference vector in the subspace 
concerned. This measure tends to zero for complete de- 


tachment. For 1D simulations 2 dominance plots can be 
1 The results of the 2D simulations do not appear in this draft 
paper. 
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overlaid to show the dominance of subspaces | and 2 for 
each node. For 2D simulations it is simplest to present 
only 1 of these plots as a 2D array of grey-scale pixels, 
where the grey level indicates the dominance of subspace 
1 (or, alternatively, subspace 2).1 


E. 1D Simulation 


The parameter values used were: (m1,m2) = (1,100), 
(1, 42) = (1,41), (w1, w2) = (1, 21), (11, l2) = (1, 15), 
K = 0.8, vy = 0.1, s = 2, n = 400, ¢ = 0.002. This value 
of & implies Kip ~ 1.96, so & is approximately an integer 
multiple of 27, as required for an artefact-free simulation. 
In figure 10 a plot of the 2 dominance curves obtained 
after 3200 training updates is shown. This dominance 
plot clearly shows alternating regions where subspace 1 
dominates and subspace 2 dominates. The width of the 
neighbourhood function is 21, which is the same the pe- 
riod of the variations in the dominance plots, i.e. within 
each set of adjacent 21 nodes half the nodes are attached 
to subspace 1 and half to subspace 2. There are boundary 
effects, but these are unimportant. 


Appendix A: Normalisation of Pr (y|x) 


The normalisation of the expression for Pr (y|x) in equation 1.8 may be demonstrated as follows: 


= ™ 1 
y=1 y=1 ee y"EN(y’) y 
Le 1 
== py, >> Qty) 
a y’=lyeN(y’) DyEN(y’) Q (xly”) 
1 m 
= — 1 
M 2» 
y’=1 
— ™yz Mg +++ Md 
~ M 
27 (Al) 


In the first step the order of the y 


at dyenty’) (ees 


and the y’ summations is interchanged using pies Ly eN(y) (---) 
), in the second step the numerator and denominator of the summand cancel out. 


Appendix B: Upper Bound for Multiple Firing Model 


It is possible to simplify equation 1.10 by using the following identity 
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Note that this holds for all choices of x’ (y;). This allows the Euclidean distance to be expanded thus 


! 2 = ! 
|x — x (y1,¥2,°°- Yn) bes a lk-x yi)| 2 Do x— x’ )-(x-x (y;)) 
ary 
= 2 (x9) (9) —* Wisy2 Yn) 
n2 era t j 1,492) n 
Li . 
+3 [2 & 9 — Wasa ye) (B2) 
Each term of this expansion can be inserted into equation 1.10 to yield 
le 2 
t LS = dx P P —x’ 
erm af r(x) Pr (y/x) [lx — x’ (y)| 
term 2 = “2 f dx Pre) J) Pr(yisyabs) (—x'(y1))- Ox’ (y2)) 
oe 1 
term 3 = —2 x term 4 
an 2 
tem4 = = S>Pr(yi,yas-++¥n) |]X” (1, ¥2s°* Yn) — eed yi) (B3) 
Y1.Y20"Yn=1 


Pr (yi, ¥2,°**¥n|X) has been assumed to be a symmetric function of (y1, y2,:--yn) in the first two results, and the 
definition of x’ (y1,y2,--:yn) in equation 1.11 has been used to obtain the third result. These results allow D in 
equation 1.10 to be expanded as D = D, + D2 — D3, where 


2— 2 
Di = = —x’ 
1 = 2 >> f ax Prix Pr(vbe Ibex’) 
y=1 
2(n-1 = 
Dy = 721 Pax Pros Yo Prlyayale) Ox! (y1))- =x’ (v2) 
yi,y2=1 
Ds = 2 YP Pr(ytsyar+¥n) x! (Yasy2s°+-¥n) — = Dx (vi) (B4) 
Y15Y2."Yn=1 i=l 


By noting that D3 > 0, an upper bound for D in the form D < D,+ D2 follows immediately from these results. Note 
that D,; > 0 whereas D2 can have either sign. In the special case where Pr (y1, y2|x) = Pr (yi|x) Pr (yi|x) (ie. yi 
and y2 are independent of each other given that x is known) D2 reduces to 


va ee 
Dy = 22) fax Pr(x) 


which is manifestly positive. This is the form of D2 that is used throughout this paper. 
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) oe (y|x) (x— x’ (y)) (B5) 


Appendix C: Derivatives of the Objective Function 


D, and Dp» are as given in equation 1.12, ie. it is assumed that Pr (yi, y2|x) = Pr (ye|x) Pr (y2|x) and Pr (y|x) 
has the scalable form given in equation 1.8. Define a compact matrix notation as follows 


Ly,y’ = Pr(y'ly) Py y' = Pr(y'|xy) 

Py = Viyremyy Py (L7 P)y = Lys Ly" y Py’ 

dy = x- x'(y) (Ld), = oyrar Lyy’ dy’ 

(PLd)y = Vyeniyy Pry (LE), (ee PLa), = see Pyry (PLd)y, (C1) 
ey = |x (y)II (Le oy =m 11 Ly,y! ey’ 

(PL <= Dy’en(y) Py y (Le), C TPL e)y = yi reN(y) Pyty (P Le), 

ke eae (L"p), dy d=r", (PLd), 


AT 
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Using this matrix notation, the functions f; (x,y), fo (x,y), g1 (x,y), and g2 (x,y) may be defined as 
f(x,y) = (Lp), dy 
f(x,y) = (L"p), a 
91 (x,y) = py (Le)y —(P* PLe), 
g2(X%,y) = (ry (Ld), — (P"™PLd),) -d 


The variation of Pr (y|x; y’) is then given by 


d5Pr(y|x:y’) = Pr(y|xsy’) | dlogQ(xly)— So Pr(y” x;y’) dlog Q (xly”) 


y”EN(y’) 
= Pr(ylxy’) So dlogQ(xly”) (yy — Pr(y”|xy’)) 
y”EN(y’) 
= Pyy > dlog Q (x|y”) (dy, y _ Py y) 
y"EN(y’) 


C2 
C3 
C4 


( 
( 
( 
(C5 


) 
) 
) 
) 


(C6) 


In order to rearrange the expressions to ensure that only a single dummy index is required at every stage of evaluation 


of the sums it will be necessary to use the result 


yYE-EE 


y=ly’eM(y) y’=lyeN(y’) 


aD, 
1. Calculate OG)" 


The derivative is given by 


aD, 
Ox! (y) 


= sf ox Pr (x) ae (yly’) So Prty'ix;y”) («-x' (y)) 


Use matrix notation to write this as 


aD, 4 = 
=-—— | dxPr(x) )0 Lyy So Pynyrdy 


y’/=1 y"EN(y’) 


Finally remove the explicit summations to obtain the required result 


OD, 4 T 
ax'(y) nM / dx Pr(x) (LT p), dy 
4 
=e dx Pr (x) fi (x,y) 


dD2 
2. Calculate a)" 


The derivative is given by 


eee ee) fexPrea (Pro) Do Pro’bay") 
Ox’ (y) nM Gat SiR) 
x{S> So Pr(gly’) So Priy/ixsy”) «x! (9) 
y=ly’=1 y"EN(y’) 
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(C7) 


(C9) 


(C10) 


(C11) 


A Self-Organising Neural Network for Processing Data from Multiple Sensors 


Use matrix notation to write this as 


oD A(n—1 ™ seas 
Ox! wy) = n fax Pr (x) yy Ly y ‘3 Py y! » > Ly 5 S> Py yi dy 


yEN(y’) y y’EN(y’) 


Finally remove the explicit summations to obtain the required result 


OD,  _— 4(n—1) 
Dx! (y) Me? je Pr (x) (LT P)y d 
A(n 


dD 
3. Calculate Slog Oty) 


The differential is given by 


6D = oy ferro) Srey) SO sPrtybsy") Ie—x’ OI 


4=1 y"EN(y’) 


Use matrix notation to write this as 


) m m 
6D, = rey; » je Pr (x) Ly'y » Py y! SS d log Q (xly") (dyyr — Py yr) ey 
y=1 y’=1 


y"EN(y’) y"EN(y”) 


Reorder the summations to obtain 


>) m 
1De= / dx Pr(x) S> dlogQ(xly”) > SS by Pity = Py gt S&S ~ Byte 


y"=1 y"EN(y'”) y’EN(y”) y’EN(y"’) 


Relabel the indices and evaluate the sum over the Kronecker delta to obtain 


2 m 
60S vat | & Pr (x) 6 log Q (x|y) 
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(C12) 


(C13) 


(C14) 


(C15) 


~ xs Pyry b> Ly y! ey: | — SS Pyny SS Pym yn ye Lyn y! ey! (C17) 
y’/=1 y’=1 


y’EN(y) y"EN(y) y”EN(y"”’) 


Finally remove the explicit summations to obtain the required result 


Fit — fe Pr (x) 610g (xly) (py (Le)y ~(P" PLe)y) 


= — > [ax Pr (x) gi (x,y) 5 log Q (xly) 
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(C18) 
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5D 
4. Calculate Tog OG 


The differential is given by 


x {So So Proviy’) So Priy'ix:y”) (x- x’ (y)) 


=1 y”EN(y’) 


do dS Privly’) So 6Priy’ixsy”) «— x’ (y)) (C19) 


=1 y"EN(y’) 


4(n —1) angen 
6D, — a fax Pr (x) du au? yy ys Py ys dy 


y"EN(y’) 


ASE S Ly'y S> Pyn, S> dlog Q (xy) (Oy,y — Pyr.yr) dy (C20) 


y=ly’=1 y"EN(y’) y"EN(y”) 
Reorder the summations to obtain 
4(n—1) eae 
eer i gePrice)| 5, Pokg py Pyn ys dy 
y=ly’=1 y”EN(y’) 


by d log Q (xly’”) bw S> dy ty! By uw — Py yn S> Py y! ye Ly'y dy, (C21) 
y"EN(y” y=1 


y"=1 x7 y ) y/EN(y”) y’EN(y”) 


Relabel the indices and evaluate the sum over the Kronecker delta to obtain 
6D ae zy dx Pr (x) 6 log Q (xly) 
gee te) 
2 EYE a x Pr (x 2 Q (x 


us ye Pyry Ly,y dy: } — ys Py y De Pym yn , Lyn y/ dy 
y’=1 


y’E€N(y) y’=1 y"EN(y) y"EN(y"”’) 


by ye Ly'y ye Py y dy (C22) 


y=1 y’/=1 y"EN(y’) 
Finally remove the explicit summations to obtain the required result 


6Ds = me Pr (x) 6 log Q (xly) (vy (Ld), — (P™ PLd),) ‘d 


- oe 2 J Pra) a2 Gs.y) Slog (aly) ee) 
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Appendix D: Expression for D; + D2 in Terms of x’ (c) 


From equation 1.12 D; + D2 can be written as 


——— | dxP P 
+ = / x Pr (x) Xu r (y|x) x 1-2 (Serr (y|x) x ‘w) (Ser (y|x) *) 
+constant (D1) 
where the constant terms do not depend on x’ (y). However, from equation 1.12 the derivative renee can be 
written as 
d(D,+D 4 - 
ute) = 5 fax Pre Priyls) [x—x/(y) + (n= 1) Pry’) -x'y))} (DY) 
Ox’ (y) n oe 
Using Bayes’ theorem the stationarity condition renee = 0 yields a matrix equation for the x’ (y) 
n [ax Pr (x|y) x =(n— 1) ys ¢ dx Pr(x|y) Pr (vb) x’ (y’) +x’ (y) (D3) 
y’=1 


which may then be used to replace all instances of x in equation D1. This yields the result 


ay 2. 


Di +Do=-= f dx Pr(x pas x) ||x! ( (yy? =F =2 fax Pr(x) 


m 


) Pe (y|x) x’ (y) 


+constant (D4) 


Appendix E: Comparison of D; + D2 for Different Types of Optima 


In order to compare the value of D,+ D2 that is obtained when different types of supposedly optimum configurations 
of the threshold functions Q (x|y) are tried, the x’ (y) that solves ee = 0 (see appendix D) must be inserted 
into the expression for D, + Dy. In the following derivations the constant term is omitted, and the definition 


Ru = || f dx Pr (xy) x||’ (see equation 2.15) has been used. 


1. Type 1 Optimum: all the nodes are attached one subspace 


In equation 1.18 D; + D2 becomes 


Di+D, = -= fax Pr(3) > Pru) \(/ x Proaly) 0) i 
=20D fix Pr(x) 3 Pele ( x: Pr Gly) x1.0) 
7 2 alu Dg - 
AD By (E1) 
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2. Type 2 Optimum: all the nodes are attached both subspaces 


In equation 1.18 D; + D2 becomes 


2 
D,+ D2 


I 

| 

| 

nee 

a 

* 
ty 

a: 
# 
y 

a 
= 
a 


(/ dx, Pr (x;|y) x, f dxs Pr (x2|y) x2) 
2022) fax Pr(x) So Pru ( [es Pray) x1 fase Pre) *) 


(? , (n=) oR 


| 2 
n 


AR aq (E2) 


3. Type 3 Optimum: half the nodes are attached one subspace and half are attached to the other 


In equation 1.18 D; + D2 becomes 


A a ae 
D,+D2z = sa cere dx Pr (x) 
a 2 M 2 
« {oPrtuls) | (fae Proaly) 1.0) + 2 Prix) [(0, f axe Proalu) x) 
y=l y=Y41 


a M = 
<|[SoProbs f dx PeGaly) 2, S Pr(vls) fds Pr Gxaly) x2 
y=1 y=H41 
a ae oe Be LV? G1) 
— 2 + 2 Ru 
n+1 2n 2 n 2 
An 
=-— E3 
mae (E3) 
[1] S P Luttrell, A Bayesian analysis of self-organising maps, ceedings on Vision, Image, and Signal Processing 141 
Neural Computation 6 (1994), no. 5, 767-794. (1994), no. 4, 251-260. 


[2] 


, Partitioned mixture distribution: an adaptive 
Bayesian network for low-level image processing, IEE Pro- 
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The Development of Dominance Stripes and Orientation Maps in a Self-Organising 
Visual Cortex Network (VICON) * 


S P Luttrellt 
Defence Research Agency, St Andrews Road, Malvern, Worcs, WR14 8PS, 
United Kingdom, tel: +44 (0) 1684 894046, fax: +44 (0) 1684 894384 


A self-organising neural network is presented that is based on a rigorous Bayesian analysis of 
the information contained in individual neural firing events. This leads to a visual cortex network 
(VICON) that has many of the properties emerge when a mammalian visual cortex is exposed 
to data arriving from two imaging sensors (i.e. the two retinae), such as dominance stripes and 


orientation maps. 


I. INTRODUCTION 


The overall goal of this work is to automate as far 
as is possible the processing of data from multiple sen- 
sors (data fusion), which includes the automatic design 
of the architecture and functionality of the network(s) 
that do the processing. In [10] a novel approach to this 
automation problem was introduced, and the purpose of 
this paper is to refine and extend the previously reported 
results. 

The problem of automating the design of a data fusion 
network has many interesting special case solutions. In 
particular, the type of self-organising neural network (in 
the mammalian visual cortex) that processes the images 
arriving from a pair of retinae is one such special case, 
where the number of sensors involved is just two. For a 
review of visual cortex neural network models see [2, 13]. 

The basic idea is to use a soft encoder (i.e. its output 
is a distributed code in which more than one, and pos- 
sibly all, of the output neurons is active) to transform 
the input vector (i.e. the input image) into a posterior 
probability over various possible classes (i.e. alternative 
possible interpretations of the input vector), and to op- 
timise the encoder so that this posterior probability is 
able to retain as much information as possible about the 
input vector, as measured in the minimum mean square 
reconstruction error (i.e. Lz error) sense [8, 12]. 

In the special case where the optimisation is performed 
over the space of all possible soft encoders, the optimum 
solution is a hard encoder (i.e. it is a “winner-take-all” 
network in which only one of the output neurons is ac- 
tive) which is an optimal vector quantiser (VQ), of the 
type described in [4], for encoding the input vector with 
minimum JL» error. In the slightly less special case where 
the space of possible soft encoders is restricted to include 
only those whose output is deliberately damaged by the 


*Typeset in JATRX on May 9, 2019. 
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effects of a noise process, this produces a different type 
of hard encoder which is an optimal self-organising map 
(SOM) for encoding the input vector with minimum L2 
error; this is very closely related to the well-known Ko- 
honen map [3], as was demonstrated in [5]. 


This paper will examine yet another special case, where 
the optimisation is performed over a very special sub- 
space of soft encoders, rather than over all possible soft 
encoders. The behaviour of each soft encoder is modelled 
by a set of posterior probabilities over various possible 
classes. When a particular parametric form for these pos- 
terior probabilities is chosen, a corresponding subspace 
of possible soft encoders is thus automatically selected, 
which may be explored by varying the parameters. The 
parametric form of the posterior probability that is used 
in this paper is based on the so-called partitioned mix- 
ture distribution (PMD) [7, 9], which is a natural gener- 
alisation of the standard mixture distribution to a high- 
dimensional input space. 


This use of a PMD leads to a 2-layer visual cortex 
network (VICON), where the components of the input 
vector are the output activities of the input neurons, and 
the components of the PMD posterior probability are the 
output activities of the output neurons. Various physi- 
cally realistic constraints are placed on the PMD opti- 
misation (both on the internal PMD structure, and on 
the type of training data that is used), and these will be 
described in the text as they arise. 


The layout of this paper is as follows. In section IT all of 
the necessary theoretical machinery is developed, includ- 
ing folded Markov chains, posterior probability models, 
derivatives of the objective function, and receptive fields. 
In section IIT the concepts of dominance stripes and ori- 
entation maps are explained, both in the context of the 
elastic net model, and in the context of theory presented 
in this paper. In section IV the results of computer simu- 
lations are presented, including both 1 and 2-dimensional 
retinae, single and pairs of retinae, both for synthetic and 
natural training data. In appendix B some explicit op- 
timal solutions that minimise the objective function are 
derived, including the periodicity property of some types 
of optimal solution. 
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(a) (b) 


Figure 1: (a) A Markov chain of transitions x ~y>y'— x’. 
(b) The same diagram as (a), but folded. 


II. THEORY 


This section covers all of the basic theoretical machin- 
ery that is required to design and train a 2-layer VICON. 
In section II A the theory of folded Markov chains (FMC) 
is summarised. In section IIB the basic idea of a posterior 
probability model is introduced, and in section II C this 
is developed into a full partitioned posterior probability 
model. In section IID the derivatives of the FMC objec- 
tive function are derived assuming a partitioned poste- 
rior probability model, and in section IIE the influence of 
finite-sized receptive fields on these derivatives is derived. 


A. Folded Markov Chain 


The basis of the entire theoretical treatment is a com- 
munication channel model [6] in which an input vector x 
is encoded to produce a conditional probability Pr (y|x) 
over code indices y, which is then transmitted along a 
distorting communication channel to produce a condi- 
tional probability Pr (y’|y) over distorted code indices y/’, 
which is finally decoded to produce a conditional PDF 
Pr (x’|y’) over reconstructions x’ of the original input 
vector x. The three steps in the sequence x >y—y/— x’ 
are modelled by the conditional probabilities Pr (y|x), 
Pr (y’|y), and Pr (x’|y’), which describe a Markov chain 
of transitions, which is shown diagrammatically in fig- 
ure L(a). Pr(x’|y) is completely determined from other 
defined quantities by using Bayes’ theorem in the form 
Pr (xly) = 7gPGozeue 

J dx! Pr(x’) Pr(ylx’) 

Because x and x’ live in the same vector space it is 
convenient to fold this diagram to produce figure 1(b); 
this is called a folded Markov chain (FMC) [6]. Figure 
1(b) is directly related to a 2-layer unsupervised neural 
network, where x and x’ represent the activity pattern 
of the whole set of neurons in the input layer, and y 
and y’ represent the location(s) of winning neuron(s) in 
the ouput layer. The overall conditional PDF generated 
by an FMC is Pr(x’|x), which is obtained by marginal- 
ising y and y’ in the joint probability Pr (x’,y’,y|x) = 
Pr (x/ly") Pr (y!ly) Pr (ylx). 

Define a network objective function D as [6] 


2 


D= | txax Pr (x) Pr(x’|x) |x — x’ || (2.1) 


which measures the expected Euclidean (or L2) recon- 
struction error caused by feeding input vectors sampled 
from Pr (x) into the FMC, where each x is returned as a 
PDF Pr(x’|x) of alternative reconstructions x’ of x. For 
simplicity, assume that the communication channel has 
been assumed to be distortionless so that Pr (y'|y) = dyy, 
and that y = 1,2,--- ,M, then 


M 
D= > | axax Pr (x) Pr(y|x) Pr (x’|y) ||x — x! ||? 
y=l1 


(2.2) 

An FMC is completely described by the form of its 
encoder Pr (y|x) and the form of its reconstruction er- 
ror ||x — x’||”. The functional form of the encoder may 
be chosen arbitrarily, and independently of the assumed 
Euclidean form of the reconstruction error, so the FMC 
does not correspond to a Gaussian mixture distribution 
model in input space. This is a general result for FMCs in 
which the functional forms of the encoder and the recon- 
struction error may be independently chosen. It is only 
when these functional forms are carefully chosen that a 
density model interpretation of an FMC is possible (for 
instance a Euclidean reconstruction error ||x — x’ ||” must 
be paired with an encoder Pr (y|x) that describes the pos- 
terior probability over class labels that would arise in a 
Gaussian mixture distribution model). 

The expression for D given in equation 2.2 may be 
simplified to yield [6] (this readily generalises to the case 
where Pr (y'|y) # dy, (ie. the communication channel 
causes distortion)) 


M 
D=2 / dx Pr (x) S~Pr(ylx) |Ix—x' (|? (2.3) 


where x’ (y) is a reference vector defined as 


x (y= jw Pr (x|y) x (2.4) 


If this definition of x’ (y) is not used, and instead D in 
equation 2.3 is minimised with respect to x’ (y), then the 
stationary solution is x’ (y) = f dx Pr(x|y) x, which is 
consistent with the definition in equation 2.4. In practice, 
it is better to determine the stationary x’ (y) by following 


the gradient OI than to use the explicit expression 


J dx Pr (x|y) x for the stationary point, because aI 


is cheap to evaluate whereas f dx Pr (x|y) x is expensive 


to evaluate. In effect, the aI approach is an example 


of on-line training, whereas the { dx Pr (x|y) x approach 
is the corresponding example of batch training, and the 
on-line and batch approaches each have their own areas 
where they are best used. 
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Pr(y|x) xy) 


(|@90©0000000 0 


Figure 2: A neural network representation of a folded Markov 
chain. 


In equation 2.3 Pr (y|x) is a “recognition model” (i.e. 
it takes an input vector and recognises by assigning to 
it a posterior probability over class labels) and x’ (y) 
is the corresponding “generative model” (i.e. it takes a 
class label and generates a corresponding vector in in- 
put space). This is a simpler type of generative model 
than appeared in the original expression for D in equa- 
tion 2.2, where the generative model is Pr (x’|y), which 
generates a whole distribution of possible vectors in in- 
put space, rather than just a single vector which is the 
centroid of Pr (x’|y). The transformation of the FMC 
from one that uses the PDF Pr (x’|y) into one that uses 
the reference vector x’ (y) is not possible in general; it 
was made possible here by choosing to use a Euclidean 
reconstruction error in D. In general, an FMC recon- 
struction is a distribution over alternative inputs, rather 
than a single representative input, as might be used in 
decision theory, for instance. 

The operation of the various terms in the expression 
for D in equation 2.3 is shown in figure 2, which is rotated 
through 90° anticlockwise with respect to the correspond- 
ing diagram in figure 1, and also for simplicity y’ = y 
because Pr (y'|y) = dy, was assumed above. When D is 
minimised with respect to the choice of encoder Pr (y|x) 
and reconstruction vector x’ (y) it yields a standard min- 
imum mean square error vector quantiser (VQ) with M 
code indices [4], and if Pr (y'|y) 4 5,,, then the VQ pro- 
duces code indices that carry information in such a way 
that it is maximally robust with respect to the damag- 
ing effects of communication channel distortion modelled 
by Pr (y’ly) [5]. This latter type of VQ can be shown 
to be approximately equivalent to a self-organising map 
(SOM) of the type introduced by Kohonen [3]. 


B. Basic Posterior Probability (Single Recognition 
Model) 


The minimisation procedure that leads to a VQ-like op- 
timum assumed that the entire space of posterior proba- 
bility functions Pr (y|x) was available to be searched. In 
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the neural network interpretation, Pr (y|x) models the 
probability that neuron y fires first (this encompasses 
both the case of a soft encoder where more than one 
neuron can potentially fire first, and the case of a hard 
encoder where only one neuron can potentially fire first; 
this is the winner-take-all case), which depends on the 
detailed underlying dynamics of how all of the neurons 
interact with each other. Because these neural dynam- 
ics are not arbitrary (e.g. they are constrained to be a 
physically realisable process), it constrains the space of 
possible posterior probabilities Pr (y|x) that is available 
to the neural network. Pr (y|x) may then be modelled by 
the functional form 


Pr (y|x) = => (2.5) 


yar Q (xly’) 


where Q (x|y) is the raw “response function” of neuron y. 


Q (x|y) may be interpreted as the raw firing rate of 
neuron y, and Pr (y|x) is then the probability that neuron 
y fires first out of all of the 1Z competing neurons. This 
functional form makes it clear that there is a type of lat- 
eral inhibition occurring between Pr (yi|x) and Pr (y2|x) 
(for y: # ye), because if the raw firing rate Q(x|y1) is 
increased so that Pr(yi|x) increases, nevertheless the 
denominator yy 1 @ (xy) ensures that Pr (y2|x) de- 
creases (for y1 # Yo); ie. the Q(xly) do not exhibit 
lateral inhibition, but the Pr (yi|x) do exhibit lateral in- 
hibition. 

The raw receptive field of a neuron depends on the 
form of Q (x|y). Thus if the functional form of Q (x|y) 
depends only on a subset X (y) of components of x, then 
X (y) is the raw receptive field of neuron y. However, this 
is not the same as the the receptive field that is effective 
in producing the first firing event, because Pr (y|x) de- 
pends on all of the ¥(y’) (for y’ = 1,--- ,M) as shown 
in equation 2.5. 

The effect of the distortion Pr (y|y’) process, as mod- 
elled by Pr(yly’), is to alter at the last minute, as it 
were, the probability that each neuron fires first. Thus 
the posterior probability is modified as follows 


M 
Pr (y|x) + S© Pr (yly’) Pr (y'|x) 


y=1 


(2.6) 


where the matrix element Pr (y|y’) leaks posterior prob- 
ability from neuron y’ onto neuron y. Such cross-talk 
amongst the neurons exists independently of the lat- 
eral inhibition effect produced by the denominator term 
yet Q (x|y’) in equation 2.5. 

The VQ and SOM results (see [3, 4]) may be obtained 
as special cases of raw neuron firing rates Q (x|y), where 
one neuron’s firing rate is much larger than the other 
M — 1 neurons’ firing rates (i.e. there is effectively only 
one neuron that can fire, so it is the winner-take-all). 
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C. Partitioned Posterior Probability (Multiple 
Recognition Models) 


The form of the posterior probability Pr (y|x) intro- 
duced in equation 2.5 is unsuitable for networks with a 
large number of neurons M, because the lateral inhibition 
is global rather than local. This can readily be inferred 
because the denominator term Se Q (x|y’) in equa- 
tion 2.5 computes a quantity that is the sum over all of 
the raw neuron firing rates. 

This problem can be amended by defining a localised 
posterior probability Pr (y|x;y’) as 


Q (xy) 6 YEN (y’) 
Pr (y|x;y" = = Py SEND Q (x|y"’) 


(2.7) 


where NV (y’) is the local neighbourhood of neuron y/’, 
which is assumed to contain at least neuron y’, and 
dyeN(y’) is a Kronecker delta that constrains y to lie in 
the neighbourhood WN (y’). If NV (y’) contains all M neu- 
rons then Pr (y|x;y’) reduces to Pr (y|x) as previously 
defined in equation 2.5. Pr (y|x;y’) has the required nor- 
malisation property that pee Pr (y|x;y’) = 1 for all y’. 
Because yy’ can take M possible values, there are M com- 
plete localised posterior probability functions Pr (y|x;y’). 
In effect, the neural network is split up into M over- 
lapping subnetworks (these subnetworks overlap where 
N (41) AN (y2) 4 0 for yr # yo), each of which com- 
putes its own posterior probability function; note that 
any overlap between a pair of subnetworks causes the 
corresponding Pr (y|x;y’) to be mutually dependent. 

It is not always convenient to use a neural network 
model in which there are M separate posterior probabil- 
ity models Pr (y|x;y’). However, these M localised poste- 
rior probability functions Pr (y|x;y’) (for the M different 
choices of y’) may be averaged together to produce a sin- 
gle posterior probability function. Thus define Pr (y|x) 
as 


1 
Pr(ylx) = a5 S> Pr(ylx:y’) (2.8) 
y’EN~1(y) 
1 1 
= —Q(xly) Do 7 
M ee yee yen (y’) @ (xly”) 
where N~! (y) is the inverse neighbourhood of neuron y 


defined as N~! (y) = {yy € N (y’)}. This definition has 
all of the properties of a posterior probability function, 


Pr(y|[x) = 
1 (which may be derived by swapping the order of 


including the normalisation property Sen 


summations using the result ge Lena) = 
ae yen (y’) (-7+)). The form of the posterior prob- 
ability given in equation 2.8, in which M individual pos- 
terior probabilities Pr (y|x;y’) are averaged together, can 
be rigorously justified from a Bayesian point of view (see 


Figure 3: A partitioned mixture distribution (PMD) neural 
network. 


appendix A). The averaging process produces the poste- 
rior probability that should be used when there are M 
contributing models (as specified by the Pr (y|x;y’) for 
y’ = 1,2,--- ,M) that have equal prior weight. The av- 
erage over the M models then simply marginalises over 
an unobserved degree of freedom (the model index y’). 

If this localised definition of Pr (y|x) given in equation 
2.8 is compared with the global definition given in equa- 
tion 2.5 it is seen that the normalisation factor has been 
modified thus 


1 : 1 > 1 
me Q (xly’) M y’EN-1(y) yen (y’) Q (xly") 
(2.9) 
lateral inhibition properties. 
is the lateral inhibition factor that 


which 


1 
Dyen(y’) Oey”) 
derives from the neighbourhood of neuron y’, which 
gives rise to a contribution to the lateral inhibition 
factor for all neurons y in the neighbourhood of y’ via 
the average 77 Dyen-1(y) (++). Thus the overall lateral 
inhibition factor acting on neuron y is derived locally 
from those neurons y’” that lie in the set VV (N~1 (y)). 
The posterior probability model defined in equation 
2.8 has been used before in the context of partitioned 
mixture distributions (PMDs), where multiple mixture 
distribution models are simultaneously optimised [7, 9]. 
Figure 3 shows the structure of the neural network cor- 
responding to the PMD posterior probability in equation 
2.8. Each output neuron has a raw receptive field of input 
neurons (which contains 5 input neurons in the example 
shown), and is also laterally inhibited by its neighbouring 
output neurons (the size of a neuron neighbourhood is 3 
neurons to either side in the example shown). Note that 
the input-output links in figure 3 do not imply that the 
raw neuron firing rates Q (x|y) can be computed by using 
simple weighted connections; they are drawn merely to 
indicate the set of input neurons that influences the raw 
firing rate of each output neuron. Similarly, the output- 
output links in figure 3 are drawn to indicate the sizes 
of the output neuron neighbourhoods; the details of how 


alters its 
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lateral inhibition modifies the raw firing rates Q (x|y) of 
the output neurons to produce the probability Pr (y|x) 
that neuron y fires first is given in equation 2.8. 

For completeness, the PMD objective function in equa- 


M M 
= aie Pr Pe a r(yly’) Q (xly’) 


5 


tion 2.3 may now be written out in full using the expres- 
sion for the PMD posterior probability in equation 2.8 to 
yield (where the effects of leakage have been included, as 
defined in equation 2.6) 


x’ (y)|I” (2.10) 
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This is the objective function that will be used to characterise to performance of the neural networks in all of the 


computer simulations. 


D. Derivatives of the Objective Function 


In order to minimise the PMD objective function in equation 2.10 its derivatives must be calculated. First of all, 


define some convenient notation [12] 


Ly y = Pr(y'|y) 
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= 1 2 
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y (PLe),, 


where L (y) denotes the leakage neighbourhood of neuron y, which is the set of neurons that have posterior probability 


leaked onto them by neuron y, and the inverse leakage neighbourhood £ 


~1 (y) is defined as £7! (y) = {y'ly € L(y’')}. 


Assume that the raw neuron firing rates may be modelled using a sigmoid function 


1 
2 (xly) = 1+ exp (—w (y)-x — b(y)) (2.12) 
whence the derivatives may be obtained in the form [12] 
ran = ur | dx Pr(x) (L"p), (x— x’ (y)) 
oo ) = = [w Pr (x) Py (Le), _ (P™PLe)y) (1— Q(xly)) ( ‘ ) (2.13) 
7) 
w (y) 
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where the two derivatives Ey j and soo have been written together for compactness. 


E. Receptive Fields 


The raw firing rate Q (x|y) of neuron y depends only 
on a subset x(y) of components of x; x(y) is thus the 
raw receptive field of neuron y. However, the posterior 
probability Pr(y|x) that neuron y fires first is derived 
from Q (x|y) by weighting it with a lateral inhibition fac- 
tor that depends on the raw firing rates of all neurons 
in N (N~' (y)), as seen in equation 2.8, so the overall 
receptive field of a neuron is rather broader than its raw 
receptive field. The effect of leakage, as defined in equa- 
tion 2.6, is to broaden the overall receptive field further 
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still. The optimal reference vector x’ (y) has non-trivial 
structure only within this overall receptive field, so inside 
the overall receptive field the components of x’ (y) must 
be subjected to an optimisation procedure to discover 
their optimal form, whereas outside the overall receptive 
field the components of x’ (y) may be set to be the aver- 
age values of the corresponding components of the train- 
ing vectors x (see the definition of x’ (y) in equation 2.4, 
which reduces to x’ (y) = f dx Pr(x) x for those compo- 
nents of x’ (y) that lie outside the overall receptive field 
of neuron y). 


In the simulations that will be presented here a sub- 
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optimal approach is used, where only those components 
of x’ (y) that lie inside the raw receptive field are opti- 
mised; this produces a least upper bound on the value 
of the objective function that would have been obtained 
if a full optimisation had been used. Also, it is assumed 
that the input data has been prepared in such a way that 
each component is zero mean. This is not actually a re- 
striction, because the objective function is invariant with 
respect to adding a different constant to each component 
of x, because it is a function of the difference x — x’. In 
this suboptimal approach, and with the zero mean as- 
sumption, the components of x’ (y) that lie outside the 
raw receptive field of neuron y will be set to zero. 

The fact that the components of x’ (y) that lie out- 
side the raw receptive field of neuron y are zero may be 
used to simplify the evaluation of the various terms ao 
and On) in equation 2.13. Thus evaluate py (Le), = 
(P? PLe), by expanding ey as 


2 2 
ey = |[x||" — 2x-x' (y) + |x’ @)I 
2 

= |[xll" +x’ (y) - («' (y) — 2x) 
which is a sum of a constant (i.e. does not depend on 
y) term ||x||? and a term x’ (y) - (x! (y) — 2x) that does 
depend on y. What happens to the constant term when 
it is substituted into py (Le), — (PY Pire),? 


(2.14) 


py (Le), —(P” PLe)y > py (L-1),-(P" PL-1)y 
= py1,—(P? P-1), 
= Py — Py 
= 0 (2.15) 


It cancels out, so ey might as well be replaced as follows 
in py (Le), — (PT Pie), 


ey — x’ (y)- (x’ (y) — 2x) (2.16) 


Because the components of x’ (y) that lie outside the raw 
receptive field of neuron y are set to zero, the x’ (y)-(---) 
operation effectively projects out any components of (- -- ) 
that happen to lie outside this raw receptive field. This 
means that the only components of x in equation 2.16 
that survive are those that lie inside the raw receptive 
field, so effectively ey depends only on quantities that lie 
inside the raw receptive field of neuron y. Note that a 
full optimisation of x’ (y), in which all components that 
lie inside the overall receptive field of neuron y are opti- 
mised, would produce a different result. 


III. DOMINANCE STRIPES AND 
ORIENTATION MAPS 


The purpose of this section is to discuss the two phe- 
nomena of dominance stripes and orientation maps. In 
section IITA a brief review of the popular elastic net 
model of dominance stripes is presented, and in section 
IIIB an informal derivation of the origin of both domi- 
nance stripes and orientation maps is given. 


Right Eye 2d 


— 
oeeoeeooeo&oeeeeeeeee 


Left Eye 


Figure 4: An elastic net oscillating back and forth in ocularity 
between a pair of retinae. 


A. Review of Dominance Stripes Using the Elastic 
Net Model 


The results that will be presented here are, broadly 
speaking, equivalent to the way in which ocular dom- 
inance stripes are obtained in the elastic net model (as 
reviewed in [2, 13]) as applied to a pair of retinae. The es- 
sential features of this type of model of ocular dominance 
are shown in figure 4 (which is copied from [2]). The left 
and right retinae are represented as 1-dimensional lines 
of units at the top and bottom of the diagram. The hori- 
zontal dimension represents distance across a retina, and 
the vertical dimension represents the ocularity degree of 
freedom. The distance between any two retinal units, ei- 
ther within or between retinae, represents the correlation 
between those two units [2]. Thus the ratio 5 determines 
the relative strength of the inter-retinal and intra-retinal 
correlations. The elastic net is represented by the line 
oscillating back and forth between the retinae. The net 
effect of the elastic net algorithm is to encourage the 
elastic net to pass as close as possible (in a well-defined 
sense) to all of the retinal units, and also to minimise 
its total length. These are conflicting requirements, and 
the oscillatory solution shown in figure 4 is typical of an 
optimal elastic net configuration, which thus predicts an 
oscillatory pattern of ocular dominance (i.e. which corre- 
sponds to dominance stripes in the case of 2-dimensional 
retinae). 


This type of model inevitably leads to dominance 
stripe formation, because the elastic net model separates 
the input components into two clusters (see figure 4) ac- 
cording to whether they belong to the left or right retina. 
In effect the output layer of the network is explicitly told 
which retina an input component belongs to, and this 
fact is expressed by the position of the component along 
the ocularity dimension. The goal in this paper is to 
construct a more natural model of dominance stripe for- 
mation, in which the ocularity dimension is revealed by a 
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Figure 5: Neural network model with a limited receptive field. 


process of self-organisation, rather than being hard-wired 
into the model. Thus, the visual cortex model that is 
presented in this paper will not explicitly label the input 
pixels as belonging to the right or left retina (as they 
are in figure 4), but will have to deduce their left/right 
retina membership from the properties of the training set 
instead. 


B. Informal Derivation of Dominance Stripes and 
Orientation Maps 


The purpose of this section is to present a simple pic- 
ture that makes it clear what types of behaviour should 
be expected from neural network that minimises the ob- 
jective function in equation 2.10. 


1. Neural Network Model 


It is assumed that each of the output neurons has only 
a limited receptive field of input neurons within each of 
the two retinae. In effect, this is a hand-crafted version 
of a “wire length” constraint, which ensures that the total 
length of the input-to-output connections is limited. In 
the context of the elastic net model this corresponds to 
the limited range of interaction between retinal units (the 
input) and elastic net units (the output). Also, it is as- 
sumed that sigmoidal neurons with local probability leak- 
age are used, which generates an effect that is analogous 
to the elastic tension in the elastic net model, because 
it encourages neighbouring neurons to adopt similar pa- 
rameter values. 

This model is drawn in figure 5 in an analogous way to 
the elastic net model in figure 4. In this model the ocular- 
ity dimension is not explicitly present, and the elasticity 
(of the elastic net) is replaced by the probability leakage 
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mechanism that enables neighbouring output neurons to 
communicate with each other. The separation of input 
neurons into left and right retinae in figure 5 is made 
only for comparison between figure 5 and the elastic net 
model in figure 4. When the left and right receptive 
fields are presented to the output neuron, all informa- 
tion about which retina the various input neurons belong 
to has been discarded; all input neurons within the left 
and right receptive fields are treated on an equal basis. 
The ocularity dimension will emerge by a process of self- 
organisation driven by the statistical properties of the 
images received by the left and right retinae. 


2. Very Low Resolution Input Images 


The simplest situation is when there are two retinae 
(as in the above elastic net model), each of which senses 
independently a featureless scene, i.e. all the units in 
a retina sense the same brightness value, but the two 
brightnesses that the left and right retinae sense are in- 
dependent of each other. This situation would arise if 
the images projected onto the two retinae were very low 
resolution, so all spatial detail is lost. This limits the 
input data to lying in a 2-dimensional space R?. If these 
two featureless input images (i.e. left and right retinae) 
are then normalised so that the sum of left and right 
retina brightness is constrained to be constant, then the 
input data is projected down onto a 1-dimensional space 
R', which effectively becomes the ocularity dimension. 
If each of the M output neurons had an infinite-sized re- 
ceptive field, then the optimal network would be the one 
in which the M neurons cooperate to give the best soft 
encoding of R?. 

However, because of the limited receptive field size and 
output neuron neighbourhood size, the neurons can at 
best co-operate together a few at a time (this also de- 
pends on the size of the leakage neighbourhood). If the 
network properties are translation invariant this leads to 
an optimal network whose properties fluctuate periodi- 
cally across the network (see appendix B), where each pe- 
riod typically contains a complete repertoire of the com- 
puting machinery that is needed to process the contents 
of a receptive field; this effect is called completeness, and 
it is a characteristic emergent property of this type of 
neural network. 

The only unexplained step in this argument is the use 
of a normalisation procedure on the input. However, if 
the input to this network is the PMD posterior proba- 
bility computed by the output layer of another such net- 
work, then there is already such a normalisation effect in- 
duced by the lateral inhibition within the PMD posterior 
probability. For featureless input images, this lateral in- 
hibition effect causes precisely the type of normalisation 
that is used above (i.e. left plus right retina brightness 
is constant) to occur naturally. 

These results are summarised in figure 6 where the oc- 
ularity dimension runs from (0,1) to (1,0), and a typical 
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Figure 6: Typical neural reference vectors for very low reso- 
lution input images. 


set of neural reference vectors is shown. The oscillation 
of these reference vectors back and forth along the oc- 
ularity dimension corresponds to the oscillations of the 
elastic net that are represented in figure 4. 


8. Low Resolution Input Images 


A natural generalisation of the above is to the case 
of not-quite-featureless input images. This could be 
brought about by gradually increasing the resolution of 
the input images until it is sufficient to reveal spatial 
detail on a size scale equal to the receptive field size. In- 
stead of seeing a featureless input, each neuron would 
then see a brightness gradient within its receptive field. 
This could be interpreted by considering the low order 
terms of a Taylor expansion of the input image about a 
point at the centre of the neuron’s receptive field: the ze- 
roth term is local average brightness (which lives on a 1- 
dimensional line R'), and the two first order terms are the 
local brightness gradient (which lives in a 2-dimensional 
space R?). When normalisation is applied this reduces 
the space in which the two images live to R! x R? x R? 
(R! from the zeroth order Taylor term with normalisa- 
tion taken into account, R? from the first order Taylor 
terms, counted twice to deal with each retina). 

The R! from the zeroth order Taylor term gives rise to 
ocular dominance stripes, which thus causes the left and 
right retinae to map to different stripe-shaped regions 
of the output layer. The remaining R? x R? then natu- 
rally splits into two contributions (left retina and right 
retina), each of which maps to the appropriate stripe. If 
the stripes did not separate the left and right retinae, 
then the R? x R? could not be split apart in this sim- 
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Figure 7: Typical neural reference vectors for low resolution 
input images. 


ple manner. Finally, since each ocular dominance stripe 
occupies a 2-dimensional region of the output layer, a di- 
rect mapping of the corresponding R? (which carries local 
brightness gradient information) to output space can be 
made. As in the case of dominance stripes alone, the 
limited receptive field size and output neuron neighbour- 
hood size causes the neurons to co-operate together only 
a few at a time, so that each local patch of neurons con- 
tains a complete mapping from R? to the 2-dimensional 
output layer. 

These results are summarised in figure 7 where the pure 
oscillation back and forth along the ocularity dimension 
that occurred in figure 6 develops to reveal some addi- 
tional degrees of freedom, only one of which is represented 
in figure 7 (it is perpendicular to the ocularity axis). 

If the leakage is reduced then the oscillation back and 
forth along the dominance axis tends to be more like a 
square wave than a sine wave, in which case figure 7 be- 
comes as shown in figure 8 where the neural reference 
vectors are bunched near to the points (0,1) and (1,0), 
and explore the additional degree(s) of freedom at each 
end of the ocularity axis. In the extreme case, where 
the ocularity switches back and forth as a square wave, 
the neurons separate into two clusters, one of which re- 
sponds only to the left retina’s image and the other to the 
right retina’s image. Furthermore, within each of these 
clusters, the neurons explore the additonal degree(s) of 
freedom that occur within the corresponding retina’s im- 
age. Note only one such degree of freedom is represented 
in figure 8; it is perpendicular to the ocularity axis. 

The above arguments can be generalised to the case 
of input images with fine spatial structure (i.e. lots of 
high order terms in the Taylor expansion are required). 
However, more and more neurons (per receptive field) are 
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Figure 8: Typical neural reference vectors for low resolution 
input images, where reduced leakage causes the ocularity to 
switch abruptly back and forth. 


required in order to build a faithful mapping from input 
space to a 2-dimensional representation in output space. 
For a given number of neurons (per receptive field) a 
saturation point will quickly be reached, where the least 
important detail (from the point of view of the objec- 
tive function) is discarded, keeping only those properties 
of the input images that best preserve the ability of the 
neural network to reconstruct its own input with mini- 
mum Euclidean error (on average). 


IV. SIMULATIONS 


Two types of training data will be used: synthetic, 
and natural. Synthetic data is used in order to demon- 
strate simple properties of the neural network, without 
introducing extraneous detail to complicate the interpre- 
tation of the results. Natural data is used to remove any 
doubt that the neural network is capable of producing in- 
teresting and useful results when it encounters data that 
is more representative of what it might encounter in the 
real world. 


In section IV A dominance stripes are produced from 
a 1-dimensional retina, and in section IV B these results 
are generalised to a 2-dimensional retina. In both cases 
both synthetic and natural image results are shown. In 
section IV C orientation maps are produced for the case 
of two retinae trained with natural images. 
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A. Dominance Stripes: The 1-Dimensional Case 


The purpose of the simulations that are presented in 
this section is to demonstrate the emergence of ocular 
dominance stripes in the simplest possible realistic case. 
The results will correspond to the situation outlined in 
figure 6. 


1. Synthetic Training Data 


The purpose of this simulation is to demonstrate the 
emergence of ocular dominance stripes, of the type that 
were shown in figure 6, by presenting a model of the 
type shown in figure 5 with very low-resolution input 
images. In fact, the resolution is so low that each image 
is entirely featureless, so that all the neurons in a retina 
have the same input brightness, but the two retinae have 
independent input brightnesses. These input images are 
normalised by processing them so that they look like the 
PMD posterior probability computed by the output layer 
of another such network; the neighbourhood size used for 
this normalisation process was chosen to be the same as 
the network’s own output layer neighbourhood size. 

In the first simulation the parameters used were: net- 
work size = 30, receptive field size = 9, output layer 
neighbourhood size = 5 (centred on the source neuron), 
leakage neighbourhood size = 5 (centred on the source 
neuron), number of training updates = 2000, update step 
size = 0.01. For each neuron the leakage probability had 
a Gaussian profile centred on the neuron, and the stan- 
dard deviation was chosen as 1, to make the profile fall 
from 1 on the source neuron to exp (3) on each of its two 
closest neighbours. 

The update scheme used was a crude gradient follow- 
ing algorithm parameterised by three numbers which con- 
trolled the rate at which the weight vectors, biasses and 
reference vectors were updated. These three numbers 
were continuously adjusted to ensure that the maximum 
rate of change (as measured over all the neurons in the 
network) of the length of each weight vector, and also the 
maximum rate of change of the absolute value of each 
bias, was always equal to the requested update step size; 
this prescription will adjust the parameter values until 
they jitter around in the neighbourhood of their optimum 
values. The optimum reference vectors could in princi- 
ple be completely determined using equation 2.4 for each 
choice of weights and biasses, but it is not necessary for 
the reference vectors to keep in precise synchrony with 
the weights and biasses. Rather, the reference vectors 
were controlled in a similar way to the weight vectors, 
except that they used three times the update step size, 
which made them more agile than the weights and biasses 
they were trying to follow. 

The ocular dominance stripes that emerge from this 
simulation are shown in figure 9. The ocularity for a 
given neuron was estimated by computing the average of 
the absolute deviations (as measured with respect to the 
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Figure 9: 1-dimensional dominance stripes after training on 
synthetic data. 


overall mean reference vector component value, which is 
zero for the zero mean training data that is used here) of 
its reference vector components within its receptive field, 
both for the left retina and the right retina. This allows 
two plots to be drawn: average value of absolute devia- 
tions from the mean in left retina’s receptive field as a 
function of position across the network, and similarly the 
right retina’s receptive field. As can be seen in figure 9, 
these two curves are approximately periodic, and are in 
antiphase with each other; this corresponds to the situ- 
ation shown in figure 6. The amplitude of the ocularity 
curves is less than the 0.5 that would be required for the 
end points of the ocularity dimension to be reached, be- 
cause one of the effects of leakage is to introduce a type of 
elastic tension between the reference vectors that causes 
them to contract towards zero ocularity. Note how the 
ocular dominance curves have a period of approximately 
7, which is slightly greater than the output layer neigh- 
bourhood size (which is 5). In the limit of zero leakage 
and infinite receptive field size the period would be equal 
to the output layer neighbourhood size, in order to guar- 
antee that a complete set of processing machinery is con- 
tained within each output layer neighbourhood size; this 
effect is called completeness. 

If the above simulation is continued for a further 2000 
updates with a reduced leakage, by reducing the stan- 
dard deviation of the Gaussian leakage profile from 1 to 
0.5, then the ocular dominance curves become more like 
square waves than sine waves, as shown in figure 10; this 
is similar to the type of situation that was shown in fig- 
ure 8, except that the input images are featureless in this 
case. 


2. Natural Training Data 


Figure 11 shows the Brodatz texture image [1] that 
was used to generate a more realistic training set than 
was used in the synthetic simulations described above. 
Figure 12 shows an enlarged portion of figure 11, where 
it is clear that the characteristic length scale of the tex- 
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Figure 10: 1-dimensional square wave dominance stripes after 
further training with reduced probability leakage on synthetic 
data. 


aa 


Figure 11: Brodatz texture image used as a natural training 
image. 


ture structure is in the range 5 — 10 pixels. This is 
large enough compared to the receptive field size (9) and 
the output layer neighbourhood size (5) that a simula- 
tion using 1-dimensional training vectors extracted from 
this 2-dimensional Brodatz image will effectively see very 
low resolution training data, and should repond approx- 
imately as described in figure 6. 


The results corresponding to figure 9 and figure 10 are 
shown in figure 13 and figure 14, respectively. 


The general behaviour is much the same in the syn- 
thetic and Brodatz cases, except that the depth of the 
ocularity fluctuations is somewhat less in the real case, 
because in the Brodatz case the training data is not ac- 
tually featureless within each receptive field. 
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Figure 12: Magnified portion of the Brodatz texture training 
image. 
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Figure 13: 1-dimensional dominance stripes after training on 
natural data. 
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Figure 14: 1-dimensional square wave dominance stripes after 


further training with reduced probability leakage on natural 
data. 
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Figure 15: 2-dimensional dominance stripes after training on 
synthetic data. 


B. Dominance Stripes: The 2-Dimensional Case 


This section extends the results of the previous section 
to the case of 2-dimensional neural networks. The train- 
ing schedule(s) used in the simulations have not been 
optimised. Usually the update rate is chosen conserva- 
tively (i.e. smaller than it needs to be) to avoid possible 
numerical instabilities, and the number of training up- 
dates is chosen to be larger than it needs to be to ensure 
that convergence has occurred. It is highly likely that 
much more efficient training schedules could be found. 


1. Synthetic Training Data 


The results that were presented in figure 9 may readily 
be extended to the case of a 2-dimensional network. The 
parameters used were: network size = 100 x 100, recep- 
tive field size = 3 x 3 (which is artificially small to allow 
the simulation to run faster), output layer neighbour- 
hood size = 5 x 5 (centred on the source neuron), leakage 
neighbourhood size = 3 x 3 (centred on the source neu- 
ron), number of training updates = 24000 (dominance 
stripes develop quickly, so far fewer than 24000 training 
updates could be used), update step size = 0.001. For 
each neuron the leakage probability had a Gaussian pro- 
file centred on the neuron, and the standard deviations 
were chosen as 1 x 1, to make the profile fall from 1 on 
the source neuron to exp (3) on each of its four closest 
neighbours. 

Apart from the different parameter values, the simula- 
tion was conducted in precisely the same way as in the 
1-dimensional case, and the results for ocular dominance 
are shown in figure 15, where ocularity has been quan- 
tised as a binary-valued quantity. These results show 
the characteristic striped structure that is familiar from 
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Figure 16: 2-dimensional dominance stripes after training on 
natural data. 


experiments on the mammalian visual cortex. The be- 
haviour near to the boundary depends critically on the 
interplay between the receptive field size(s) and the out- 
put layer neighbourhood size(s). 


2. Natural Training Data 


The simulation, whose results were shown in figure 15, 
may be repeated using the Brodatz image training set 
shown in figure 11, to yield the results shown in figure 
16. These results are not quite as stripe-like as the results 
in figure 15, because in the Brodatz case the training data 
is not actually featureless within each receptive field. 


C. Orientation Maps 


The purpose of the simulations that are presented in 
this section is to demonstrate the emergence of orienta- 
tion maps in the simplest possible realistic case. In the 
case of two retinae, the results will correspond to the 
situation outlined in figure 7 (or, at least, a higher di- 
mensional version of that figure). 


1. Orientation Map (One Retina) 


In this simulation the parameters used were: network 
size = 30 x 30, receptive field size = 17 x 17, output 
layer neighbourhood size = 9 x 9 (centred on the source 
neuron), leakage neighbourhood size = 3 x 3 (centred on 
the source neuron), number of training updates = 24000, 
update step size = 0.01. For each neuron the leakage 
probability had a Gaussian profile centred on the neuron, 
and the standard deviations were chosen as 1 x 1, to make 
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Figure 17: Orientation map after training on natural data. 


the profile fall from 1 on the source neuron to exp (5) on 
each of its four closest neighbours. 

Note that both the receptive field size and the output 
layer neighbourhood size are substantially larger than in 
the 2-dimensional dominance stripe simulations, because 
many more neurons are required in order to allow orien- 
tation maps to develop than to allow dominance stripes 
to develop; in fact it would be preferable to use even 
larger sizes than were used here. To limit the computer 
run time this meant that the overall size of the neural 
network had to be reduced from 100 x 100 to 30 x 30. 
The training set was the Brodatz texture image in figure 
11. 

The results are shown in figure 17 where the receptive 
fields have been gathered together in a montage. There 
is a clear swirl-like pattern that is characteristic of orien- 
tation maps. Each local clockwise or anticlockwise swirl 
typically circulates around an unoriented region. 


2. Using the Orientation Map 


In figure 18 the orientation map network shown in fig- 
ure 17 is used to encode and decode a typical input image. 
On the left of figure 18 the input image (i.e. x) is shown, 
in the centre of figure 18 the corresponding output (i.e. 
its PMD posterior probability Pr (y|x)) produced by the 
orientation map is shown, and on the right of figure 18 the 
corresponding reconstruction (i.e. 4 Pr (y|x) x’ (y)) 
is shown. 

The output consists of a number of isolated “activity 
bubbles” of posterior probability, and the reconstruction 
is a low resolution version of the original input. The form 
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Figure 18: Typical input, output and reconstruction produced 
by the orientation map. 


of output is familiar as a type of “sparse coding” of the 
input, where only a small fraction of the neurons partic- 
ipate in encoding a given input (this type of transforma- 
tion of the input is central to the work that was reported 
in [14]). This type of encoding is very convenient be- 
cause it has effectively transformed the input into a small 
number of constituents each of which corresponds to an 
activity bubble, rather than transforming the input into 
a representation where the output activity is spread over 
all of the neurons, which is thus not easily interpretable 
as arising from a small number of constituents. 

The reconstruction has a lower resolution than the in- 
put because there are insufficient neurons to faithfully 
record all the information that is required to reconstruct 
the input exactly (e.g. probability leakage causes neigh- 
bouring neurons to have a correlated response, thus re- 
ducing the effective number of neurons that are avail- 
able). The featureless region around the edge of the re- 
construction is an artefact, which occurs because fewer 
neurons (per unit area) contribute to the reconstruction 
near the edge of the input array. 


8. Orientation Map (Two Retinae) 


The above orientation map results may be generalised 
to the case of two retinae. The parameter values used 
were the same, apart from the standard deviation of the 
leakage Gaussian which was reduced to 0.5 x 0.5 in or- 
der to allow more detailed structure to develop in the 
adaptive parameter values of the output neurons. This 
is necessary because the presence of two retinae causes 
dominance stripes to develop, which allows only half of 
the neurons to be allocated to each retina, so a complete 
repertoire of computing machinery must be forced into 
half the number of neurons that were used in the case of 
one retina. 

The results are shown in figure 19 where the receptive 
fields for the left and right retinae have been used to cre- 
ate a colour separation in which one retina is coded as 
blue and the other as yellow. Within each retina there 


1 In this black and white version of the paper the blue/yellow 
channel is displayed on the left/right respectively. 


65 


13 


is a long-scale periodic fluctuation in overall brightness 
which corresponds to the dominance stripes. Within each 
dominance stripe there is the characteristic swirl-like pat- 
tern of the orientation map. Note that the unoriented re- 
gions typically occur at the centre of dominance stripes, 
as observed in the visual cortex; this can be understood 
intuitively by referring to figure 8. 

A larger simulation would be required in order to ac- 
curately estimate the detailed orientation map as a vec- 
tor flow field. Such simulations could be used to verify 
whether the iso-orientation contours typically lie perpen- 
dicular to the dominance stripe boundaries, as observed 
in the visual cortex. The dominance stripe structure that 
appears in this simulation is not as distinct as the stripes 
in figure 16. This is not a fundamental problem, but 
rather it is a result of the limited size of computer simu- 
lation that could be run in a reasonable length of time. It 
should also be noted that the dominance stripes that are 
observed in the visual cortex are sometimes more blob- 
like than stripe-like [13], so it is pleasing that different 
choices of parameter value should yield a variety of de- 
grees of stripiness in our simulations. 


V. CONCLUSIONS 


This paper has shown how folded Markov chains 
(FMCs) [6] can be combined with partitioned mixture 
distributions (PMDs) [7] to yield a class of self-organising 
neural networks that has many of the properties that are 
observed in the mammalian visual cortex [2, 13], which 
are thus called visual cortex networks (VICON). These 
neural networks differ from previous models of the vi- 
sual cortex, insofar as they model the neuron behaviour 
in terms of their individual firing events, and operate 
in the real space of input images rather than a hand- 
crafted abstract space, and the use of Bayesian methods 
makes the nature of the network’s computations clearer 
than in the case where the network behaviour is sim- 
ply postulated. When the neural network structure (e.g. 
receptive field size) parameters are appropriately chosen, 
dominance stripes and orientation maps emerge naturally 
when the network is trained on a natural image (e.g. a 
Brodatz texture image). 

These results show how this type of network is capa- 
ble of self-organising its internal parameters in familiar 
ways when trained on data from multiple sources (actu- 
ally, only two sources in the case of the visual cortex-like 
network). The same network objective function could be 
used when an arbitrary number of data sources is pre- 
sented, and it is anticipated that it would lead to analo- 
gous results. 

An extension of the network objective function to the 
case where sets of multiple neural firing events are con- 
sidered has been published [8, 12], and an extension to 
the case of a multilayer network has been published [11]. 
When combined, these extensions could be applied to the 
problem of the processing of data from multiple sensors 
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Figure 19: Orientation map and dominance stripes after training on natural data. 


(i.e. data fusion). 


Appendix A: Bayesian PMD 


In this section a fully Bayesian interpretation of a par- 
titioned mixture distribution (PMD) will be presented. 

Consider the general problem of computing a posterior 
probability Pr (y|x) over classes y given an input vector 
x. If there is more than one model & then Pr (y|x) is 


given by a marginal PDF 
» Pr (y, k|x) 


where Pr (y, k|x) is the joint PDF of class y and model k 
given an input vector x. Bayes’ theorem may be used to 
rewrite this as follows 


Pr(y,k|x) = 


Pr (y|x) = (Al) 


Pr (y, &, x) 
Pr (x) 
Pr (y|k,x) Pr (k, x) 
Pr (x) 
= Pr(y|k,x) Pr (k) (A2) 
where Pr(k,x) = Pr(k) Pr(x) (ie. independence of 
model & and data vector x) has been assumed in the 


last step. Thus the posterior probability Pr (y|x) may be 
written as 


r(y|x) = y; Pee (y|k,x) Pr (k) 
k 


(A3) 


Assume that there are M models, and that the prior 
probabilities Pr(k) of the various models are equal, so 
that Pr (k) = a in which case the posterior probability 
reduces to 


Pr (y|x) = (A4) 


S| 


1 M 
S| Pr (yl, x) 
k=1 


which is an average of M contributing posterior proba- 
bilities (one from each of the contributing models). The 
PMD posterior probability in equation 2.8 is a special 
case of this result. 

More generally, the prior probabilities Pr(k) are k- 
dependent, and might be chosen in some optimal fash- 
ion to best handle the training set. The simplest way 
of determining an optimal Pr (k) is to minimise D with 
respect to Pr (k); this merely extends the space in which 
D is optimised to include more of the parameters inside 
Pr (y|x). 


Appendix B: Optimal Solutions 


In this section the the objective function D will be min- 
imised in the case where the input space consists of one 
or more subspaces, within each of which all of the input 
vector components have the same value. In the language 
of imaging sensors, these special cases correspond to each 
sensor viewing a featureless scene (i.e. all pixels having 
the same brightness value), which is effectively the lowest 
order term in a Taylor expansion of the spatial variation 
of pixel brightness values. This might not appear to be an 
interesting scenario to consider, but it leads to a highly 
non-trivial optimal network behaviour when D is min- 
imised. More complicated input statistics leads to even 
more complicated optimal network behaviour, so only the 
simplest case described above will be considered at first. 


1. One Input Subspace 


This may be used to optimise the network for a single 
sensor viewing a featureless scene. For a d-dimensional 
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input space Pr (x) is thus given by 


d 


Pr (x) = Pr (a1) [ [6 (ai — 21) 


1=2 


(Bl) 


whence the objective function D in equation 2.3 reduces 
to 


M 
D=2d f dey Pr(e1)Y Pr(gler) (1-24 (y))? (B2) 


y=l 


This is d times the objective function for a 1-dimensional 
soft scalar quantiser which encodes inputs in 2x,-space 
whose PDF is Pr (1). 


2. Two Input Subspaces 


This may be used to optimise the network for a pair 
of sensors each of which views a featureless scene, and 


M 
D=d / dary dary Pr (01,22) ) Pr (ylei,ct2) (a1 ~ 24 ())” + (a2 ~ 24 (W))”) 


This is ¢ times the objective function for a 2-dimensional 
soft vector quantiser which encodes inputs in (#1, 22)- 
space whose PDF is Pr (21,22). This result generalises 
in the obvious way to a larger number of input subspaces. 


3. PMD Posterior Probability 


In the above special cases each neuron potentially re- 
sponds to all of the components of the input vector. If 
this were to be built in hardware, then each neuron would 
have a number of inputs equal to the dimensionality of 
the input space, which becomes unwieldy if the input 
space had a high dimensionality (e.g. an image). For 
high-dimensional inputs it is sensible to limit the num- 
ber of inputs to each neuron, which can readily be im- 
plemented by imposing a finite-sized receptive field on 
the input of each neuron, such that it can respond only 
to a limited subset of all of the input vector components. 
This constraint will prevent the ideal vector quantiser so- 
lutions from being obtained, so the purpose of this section 
is to derive the constrained optimal solution. Note that 
this type of input is a special case of the type of solution 
that would be obtained by adding a “wire-length” penalty 
term to the objective function in order to penalise the 
connection of a neuron to too many input components. 

Even if receptive fields are used to restrict the length of 
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which are possibly correlated with each other. The one 
input subspace case above can readily be generalised to 
more input subspaces. Let the d-dimensional input space 
be split into two dimensional subspaces, where Pr (x) 
is given by 


Pr (x) = Pr (21, 22) Ile (toz-1 — #1) 6 (X27 — @2) (B3) 


where one of the subspaces consists of the odd-numbered 
components, and the other the even-numbered compo- 
nents of the input vector (this particular ordering of the 
components is not important). Whence the objective 
function D in equation 2.3 reduces to 


the input connections, the posterior probability Pr (y|x) 
effectively needs long-range lateral connections between 
the output neurons in order to implement the normalisa- 
tion condition em Pr (y|x) = 1. The simplest example 
of this is the standard vector quantiser, whose winner- 
take-all property requires that all neurons are laterally 
connected to all other neurons even if each of them has 
only a finite-sized receptive field. A partitioned mixture 
distribution (PMD) posterior probability, in which the 
posterior probability Pr (y|x) is only locally connected, 
can be used to ensure that all the connections in the net- 
work are local (see section IIC). 


a. Receptive Fields 


Write the input vector as x = (X (y) , X (y)) where x (y) 
is the part of x that lies within the receptive field of 
neuron y, and, for simplicity, assume that the receptive 
field used for x’ (y) is chosen to be the same as that for 
x(y), and that all receptive fields see the same number 
w of input components. Because the input vector is split 
into twosubspaces as x = (x),X2), its decompositon as 
(x (y) ,X(y)) may similarly be split into two subspaces as 
 (y) = (1 (y) Ke (y)) and & (y) = (% (y) ,e (y)).. Use 
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the orthogonality of X (y) and X (y) to write (for 7 = 1,2) 
= ~ ! 2 = 2 ~ ! 2 
Xi (Y) +i (Y) — 5 IT = Ne I +1 (Y) — % II 


and simplify D in equation 2.3 thus 


D=2 f dic doce Pr (2c, x2) J Pr (ular) (8 (W) I? + Re ODI? + I (W) — 4 Co) + Ie (v) — 5 DI) (BB 


y 


There are two terms to consider. 


1. ||X1 (y)||’ +||X2 (y)||?.. This is the contribution from 
outside the y*” receptive field, which is the Lz norm 
of those components of the input vector that lie 
outside the y*” receptive field. 


2 
| 


2. ||%1 (y) — x4 (y)II° + [l&2 (y) — xo (y)|I?: This is the 
contribution from inside the y'” receptive field, 
which is the Ly norm of those components of the 
error vector (i.e. input minus reconstruction) that 
lie inside the y*” receptive field. 


b. Simplify the |x (y) |? + [IX2 (y) |? Term 
IlX1 (y) ||? + |]X2 (y)||? is the Lz norm of those compo- 
nents of the input vector that lie outside the y’” receptive 
field, which is known once the input vector is specified. 
Furthermore, because of the assumed input PDF (i.e. all 


1 
Pr (y|X1,X2) = M S 


yEN—1(y) 


I 


The overall receptive field that effects the value of 

Pr (y|x1,X2) (for a given y) may be read off this ex- 

pression. Thus x(y”) comprises those components of x 

that lie within the receptive field of neuron y”, and the 
1 . 

EN 2p) Dyrenwye) operation compounds these 


x(y”) so that the overall set of components of x that 
are needed for the purposes of calculating Pr (y|x1, x2) is 
given by (using a somewhat cavalier notation) 


kXw= LU LU x") 


yEN-1T(y) YEN (y’) 


(B8) 


The individual Pr (y|xi,x2;y’) that contribute to 
Pr (y|x1,X2) each depend on a smaller set of components 


1 
57 @ (X1, X2ly) == 
M ee (y) yen (y’) Q (X1, X2|y"’) 


input components in each subspace have the same value), 
together with the assumed receptive field prescription 
(ie. all receptive fields are the same size w), this L 
norm is independent of y given that x is known, so this 
term has the following contribution to D 


D=(d—-w) (/ dx, Pr(x1) x7 + [des Pr (x2) v4) 
(B6) 


where d — w is the number of input components that lie 
outside each receptive field. 


c. Simplify the || (y) — x4 (y)II’ + [l&2 (y) — x2 (y)I|? Term 


Assume that Pr (y|xi, x2) has the PMD form of a sum 
over mixture distribution posterior probabilities (as de- 
scribed in section IIC), so that 


Pr (y[x1, Xoiy') 


1 


(By) 


of x than the full Pr (y|x1, x2), because there is one less 
summation over a y variable. However, it is convenient, 
and imposes no constraint, to use the full set of compo- 
nents thus 


Pr (ylx1,x254") = Pr (ylX1 (y),X2(y)s9) (Bd) 


The y and y’ summations can be interchanged using 


M M 
Se) Pt NG (- a ) = yet ENG) (- ag ), whence 
the contribution to D is 
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M 
D= = f dx dxsPr (xm) x = (vi: (y) Xe (w) 9") (ha (y) — 41 (WIP? + [2 (y) — x4 (WII?) (B10) 


Because the components of X; (y) are a subset of the components of X; (y) (for i = 1,2), Pr (x1, x2) can be marginalised 
to yield 


M 
D= 7) LY [eX aXe) Pr (KW) Kew) Pr (viK W), KW sv) 
y'=lyeN (y’) 
x (Ils. (y) = 4 (W)I? + Ie (y) — x4 WIP) (Bil) 


Because Pr (x1, X2) specifies that all of the components in each subspace are the same, this contribution to D may be 
simplified to 


M 
D= EY fderdes Pr (ory) Pe yler.easy’) (Cae WP + @2—2hW)") (B12) 


y’=lyeN(y’) 


d. Periodic Optimal Solutions 


Combining the results from outside (equation B6) and inside (equation B12) the receptive fields yields finally 


D = (d—w) (fan Pr (x1) ot + f des Pr (x2) 4) 


M 
+ > S- fu dxz Pr (21,22) Pr(y|x1,22;y') 16 ((e1 — 2} (y))? + (rq — x4 (y))”) (B13) 


y’=1LyeN(y') 


The first of these terms is constant, so it may be ignored insofar as network optimisation is concerned. The second 
term is much more interesting. It is the sum of the objective functions of a large number of 2-dimensional soft vector 
quantisers. However, these objective functions cannot be optimised independently of each other, because the posterior 
probabilities Pr (y|71, x2; y’) force the neurons to share parameters with each other. 

Drop the constant term, and interchange the order of summation to obtain 


M 
D= wy) f dey dry Pr(e1,a2) Pr (uber, a) (er — a4 (y))? + (ea — 24 W)?) (B14) 
y=1 
where Pr (y|a1, 72) is the PMD posterior probability given by 
1 
Pr(ylore2)= 57 D> Pr(vles,o2iy’) (B15) 
y’EN-*(y) 


Now suppose that Pr (y|z1,x2) and 2! (y) have the periodicity property 
Pr(y+mlai,x2) = Pr (ylai, 2x2) 
a (y+m) = 2; (y) (B16) 


where the fact that y is restricted to 1 < y < M has been ignored for simplicity, then D can be simplified thus (again, 
ignoring the fact that y is restricted to 1 < y < M) 


1 m(yo+1) 


Daw Yo f desde, Pr(er,02) Pr (yler22) ((e1~ 24 (y))? + (a2 ~ 24 ()’) 


yo=0 y=myotl 


I 


wy fa dag Pr (#1, #2) ~ Pr (y|x1, £2) ((x1 — 2 (y)) + (%2— £9 (y))”) 


(B17) 
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where 4 at Pr (y|%1, 22) = 1 follows from pve Pr (y|21,%2) = 1 and the periodicity property, so “ Pr (y|a1, x2) 


serves as a posterior probability for 1 < y < m. 


This demonstrates that if the optimal solution is peri- 
odic, with period m, then the objective function is pro- 
portional to the objective function for a 2-dimensional 
soft vector quantiser with m neurons. Note that thus 
far nothing has been said about the actual value of m; 
its optimal value depends on the interplay between the 
receptive field size(s), the output layer neighbourhood 
size(s), and the leakage neighbourhood size(s). Because 
this type of periodic solution is essentially a set of over- 
lapping m neuron soft vector quantisers, each set of m 
neurons will typically exhibit the properties of such quan- 
tisers. In particular this means that each set of m neu- 
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The problem of optimising a network of discretely firing neurons is addressed. An objective 
function is introduced which measures the average number of bits that are needed for the network 
to encode its state. When this is minimised, it is shown that this leads to a number of results, such 
as topographic mappings, piecewise linear dependence on the input of the probability of a neuron 


firing, and factorial encoder networks. 


I. INTRODUCTION 


In this paper the problem of optimising the firing char- 
acteristics of a network of discretely firing neurons will 
be considered. The approach adopted will not be based 
on any particular model of how real neurons operate, 
but will focus on theoretically analysing some of the in- 
formation processing capabilities of a layered network of 
units (which happen to be called neurons). Ideal net- 
work behaviour is derived by choosing the ideal neural 
properties that minimise an information theoretic objec- 
tive function which specifies the number of bits required 
by the network to encode the state of its layers. This is 
done in preference to assuming a highly specific neural 
behaviour at the outset, followed by optimisation of a few 
remaining parameters such as weight and bias values. 

Why use an objective function in the first place? An 
objective function is a very convenient starting point (a 
set of “axioms”, as it were), from which everything else 
can, in principle, be derived (as “theorems”, as it were). 
An objective function has the same status as a model, 
which may be falsified should some counterevidence be 
discovered. The objective function used in this paper is 
the simplest that is consistent with predicting a number 
of non-trivial results, such as topographic mappings, and 
factorial encoders (which are discussed in this paper). 
However, it does not include any temporal information, 
nor any biological plausibility constraints (other than the 
fact that the network is assumed to be layered). More 
complicated objective functions will be the subject of fu- 
ture publications. 

In section II an objective function is introduced, and 
its connection with discretely firing neural networks is de- 
rived. In section III some examples are presented which 
show how this theory of discretely firing neural networks 


leads to some non-trivial results. 
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II. THEORY 


In this section a theory of discretely firing neural net- 
works is developed. Section IIA introduces the objec- 
tive function for optimising an encoder, and section IIB 
shows how this can be applied to the problem of optimis- 
ing a discretely firing neural network. 


A. Objective Function for Optimal Coding 


The inspiration for the approach that is used here is 
the minimum description length (MDL) method [5]. In 
this paper, a training set vector (which is unlabelled) will 
be denoted as x, a vector of statistics which are stochas- 
tically derived from x will be denoted as y, and their 
joint probability density function (PDF) will be denoted 
as Pr(x,y). The problem is to learn the functional form 
of Pr(x,y), so that vectors (x,y) sampled from Pr(x, y) 
can be encoded using the minimum number of bits on 
average. It is unconventional to consider the problem of 
encoding (x,y), rather than x alone, but it turns out 
that this leads to many useful results. 

Thus Pr(x,y) is approximated by a learnt model 
Q(x,y), in which case the average number of bits re- 
quired to encode an (x,y) sampled from the PDF 
Pr(x,y) is given by the objective function D, which is 
defined as 


DS - f ax Pr(x, y) log Q(x, y) (1) 


y 


Now split D into two contributions by using Pr(x, y) = 
Pr(x) Pr(y|x) and Q(x, y) = Q(x) Q(y|x). 

pins several subsequently published papers. ©1998 British Crown 
Copyright /DERA 
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Dae: i: dx Pr(x) 37 Pr(ylx) log Q(xly) — 37 Pr(y) log Q(y) (2) 
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The first term is the cost (ie. the average number of 
bits), averaged over all possible values of y, of encoding 
an x sampled from Pr(x|y) using the model Q(x|y). This 
interpretation uses that Pr(x) Pr(y|x) = Pr(y) Pr(xly). 
The second term is the cost of encoding a y sampled 
from Pr(y) using the model Q(y). Together these two 
terms correspond to encoding y (the second term), then 
encoding x given that y is known. 

The model Q(x, y) may be optimised so that it min- 
imises D, and thus leads to the minimum cost of encoding 
(x,y) sampled from Pr(x, y). Ideally Q(x, y) = Pr(x, y), 
but in practice this is not possible because insufficient in- 
formation is available to determine Pr(x, y) exactly (i.e. 
the training set does not contain an infinite number of 
(x,y) vectors). It is therefore necessary to introduce a 
parametric model Q(x, y), and to choose the values of 
the parameters so that D is minimised. If the number of 
parameters is small enough, and the training set is large 
enough, then the parameter values can be accurately de- 
termined. 

A further simplification may be made if y can occupy 
much fewer states than x (given y) can, because then the 
cost of encoding y is much less than the cost of encoding 
x (given y) (i.e. the second and first terms in equation 
2, respectively). In this case, it is a good approximation 
to retain only the first term in equation 2. This approx- 
imation becomes exact if Q(y) assigns equal probability 
to all states y, because then the third term is a con- 
stant. The reason for defining the objective function D 
as in equation 1, rather than defining it to be the first 
term of equation 2, is because equation 1 may be readily 
generalised to more complex systems, such as (x,y,z) in 
which Pr(x, y, z) = Pr(x) Pr(y|x) Pr(zly), and so on. An 
example of this is given in section III A. 

It is possible to relate the minimisation of D to the 
maximisation of the mutual information J between x and 
y. If the cost of encoding an x sampled from Pr(x) using 
the model Q(x) (ie. — f dx Pr(x) log Q(x)) and the cost 
of encoding a y sampled from Pr(y) using the model Q(y) 
(ie. —)?, Pr(y) logQ(y)) are both subtracted from 
D, then the result is — {dx }°, Pr(x, y) log (a2S35). 
When Q(x,y) — Pr(x,y) this reduces to (minus) the 
mutual information J between x and y. Thus, if the cost 
of encoding the correlations between x and y is much 
greater than the cost of separately encoding x and y 
(i.e. the log (Q(x) Q(y)) term can be ignored in J), then 
D-minimisation approximates J-maximisation, which is 
another commonly used objective function. 


B. Application to Neural Networks 


In order to apply the above coding theory results to 
a 2-layer discretely firing neural network, it is necessary 
to interpret x as a pattern of activity in the input layer, 
and y as the vector of locations in the output layer of 
a finite number of firing events. The objective function 


D is then the cost of using the model Q(x,y) of the 
network behaviour to encode the state (x, y) of the neural 
network (i.e. the input pattern and the location of the 
firing events), which is sampled from the Pr(x,y) that 
describes the true network behaviour. For instance, a 
second neural network can be used solely for computing 
the model Q(x, y), which is then used to encode the state 
(x,y) of the above first neural network. Note that no 
temporal information is included in this analysis, so the 
input and output of the network is a static (x, y) vector 
containing no time variables. 

These two neural networks can be combined into a sin- 
gle hybrid network, in which the machinery for comput- 
ing the model Q(x, y) is interleaved with the neural net- 
work, whose true behavior is described by Pr(x, y). The 
notation of equation 2 can now be expressed in more neu- 
ral terms, where Pr(y|x) is then a recognition model (i.e. 
bottom-up) and Q(x|y) is then a generative model (i.e. 
top-down), both of which live inside the same neural net- 
work. This is an unsupervised neural network, because it 
is trained with examples of only x-vectors, and the net- 
work uses its Pr(y|x) to stochastically generate a y from 
each x. 


Now introduce a Gaussian parametric model Q(x|y) 


cap eS) 


where x’(y) is the centroid of the Gaussian (given y), 7 
is the standard deviation of the Gaussian. Also define a 
soft vector quantiser (VQ) objective function Dyaq as 


Q(xly) = 


Dyg =2 f dx Pr (x) S>Pr(ylx) |x x')I? (4) 


which is (twice) the average Euclidean reconstruction er- 
ror that results when x is probabilistically encoded as y 
and then deterministically reconstructed as x’(y). These 
definitions of Q(x|y) and Dyq allow D to be written as 


D= a Dyq — log (v2r0) dim x — ye Pr(y) log Q(y) 


4 

(5) 
where the second term is constant, and the third term 
may be ignored if y can occupy much fewer states than x 
(given y) can. The conditions under which the third term 
can be ignored are satisfed in a neural network, because 
x is an activity pattern, and y as the vector of locations 
of a finite number of firing events. 

The first term of D is proportional to Dya, whose 
properties may be investigated using the techniques in 
[3]. Assume that there are n firing events, so that y = 
(Y1;Y2;°°* Yn), then the marginal probabilities of the 
symmetric part S [Pr(y|x)] of Pr(y|x) under interchange 
of its (y1, y2,--: , Yn) arguments are given by 
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M 
S- S[Pr(yi,yo,°°* 


Pr(yi|x) = »Yn|X)| 
Y2s° Yn=1 
M 
Priy,yolx) = > S[Pr(yi.ye.--+ »Ynlx)] (6) 
Y3 Yn=1 


where Pr(yi|x) may be interpreted as the probability that the next firing event occurs on neuron y (given x), Also 


define 2 useful integrals, D, and Do, as 


dD, 


M 


Do 


where x’(y) is any vector function of y (i.e. not necessar- 
ily related to x’(y)), to yield the following upper bound 
on Dva 


Dyg < Di + De (8) 


where D, is non-negative but D2 can have either sign, 
and the inequality reduces to an equality in the case 
n = 1. Thus far nothing specific has been assumed about 
Pr(y|x), other than the fact that it contains no temporal 
information, so the upper bound on Dyqg applies what- 
ever the form of Pr(y|x). 

If the firing events occur independently of each other 
(given x), then Pr(y1, y2|x) = Pr(yi|x) Pr(y2|x), which 
allows Dz to be redefined as 


where D2 is non-negative. 

In summary, the assumptions which have been made 
in order to obtain the upper bound on Dyg in equation 
8 with the definition of D, as given in equation 7 and D2 
as given in equation 9 are: no temporal information is 
included in the network state vector (x,y), y can occupy 
much fewer states than x (given y) can, and firing events 
occur independently of each other (given x). In reality, 
there is always temporal information available, and the 
firing events are correlated with each other, so a more re- 
alistic objective function could be constructed. However, 
it is worthwhile to consider the consequences of equation 
8, because it turns out that it leads to many non-trivial 
results. 

The upper bound on Dyg may be minimised with re- 
spect to all free parameters in order to obtain a least up- 
per bound. In the case of independent firing events, the 
free parameters are the x’(y) and the Pr(y|x). These two 
types of parameters cannot be independently optimised, 


Dz = 
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M 
= [dx P(x EP r(ylx) [Ix — x'(y)|l? 


nt) / dx Pr(x) S° Pr(yr,yelx) (x—x!(y1))- & — x'(y2)) (7) 


y1,y2=1 


because they correspond to the generative and recogni- 
tion models implicit in the neural network, respectively. 

A gradient descent algorithm for optimising the param- 
eter values may readily be obtained by differentiating D; 
and D2 with respect to x’(y) and Pr(y|x). Given the free- 
dom to explore the entire space of functions Pr(y|x), the 
optimum neural firing behaviour (given x) can in princi- 
ple be determined, and in certain simple cases this can 
be determined by inspection. If this option is not avail- 
able, such as would be the case if biological contraints 
restricted the allowed functional form of Pr(y|x), then 
a limited search of the entire space of functions Pr(y|x) 
can be made by invoking parametric model of the neural 
firing behaviour (given x). 


Ill. EXAMPLES 


In this section several examples are presented which 
illustrate the use of D; + D2 in the optimisation of dis- 
cretely firing neural networks. In section HIA a topo- 
graphic mapping network is derived from Dy, alone, in 
section IIIB Pr(y|x) that minimises D; + D2 is shown to 
be piecewise linear, and a solved example is presented. 
Finally, in section IIIC a more detailed worked example 
is presented, which demonstrates how a factorial encoder 
emerges when D, + D2 is minimised. 


A. Topographic Mapping Neural Network 


When an appropriate from of Vyq is considered, it can 
be seen that it leads to a network that is closely related 
to Kohonen’s topographic mapping network [1]. 

The derivation of a topographic mapping network that 
was given in [2] will now be recast in the framework of 
section IIB. Thus, consider the objective function for a 
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3-layer network (x,y, z), in which (compare equation 1) 
D= -{> dx Pr(x, z) log Q(x, z) (10) 


where the cost of encoding y has been ignored, so that 
effectively only a 2-layer network (x, z) is visible, and 
Dygq is given by 


Mz 
Dyg = 2 fax Pr(x) S- Pr(z|x) ||x — x'(z)||° (11) 


This expression for Dyg explicitly involves (x, z), but it 
may be manipulated into a form that explicitly involves 
(x,y). In order to make simplify this calculation, Dva 
will be replaced by the equivalent objective function 


Mz 
Dyq = [ow Pr(x) S_Pr(el) fa Pr(x’|z) |x -— x’ ||? 


(12) 
Now introduce dummy integrations over y to obtain 


My Mz My 
Dyg = f dx Pr(xs) J” Prtulas) Y Pr(ely) Yo Pr(y/l2) fax! P(x!) Ibe — x1? (13) 


and rearrange to obtain 


My 
Dyg = fx Pr(xe) Yo Pr(y/) fax! Pex!) Ibe — x1? (14) 


where 
M, 
Pr(y'|y) = S_ Pr(y/|z) Pr(zly) 
M, 
Pr(y/|x) = S° Pr(y/ly) Pr(ylx) (15) 


which may be replaced by the equivalent objective func- 
tion 


My 


Dvq = 2 fax Pr(x) $> Pr(y/|x) |Ix—x'(y')II (16) 


y= 


By manipulating Dvyq from the form it has in equation 
11 to the form it has in equation 16, it becomes clear that 
optimisation of the (x, z) network involves optimisation 
of the (x, y’) subnetwork, for which an objective function 
can be written that uses a Pr(y’|x) as defined in equation 
15. When optimising the (x,y’) subnetwork, Pr(y’|y) 
takes account of the effect that z has on y. 

If n = 1, so that only 1 firing event is observed, then 
Dygq = Dj, and the optimum Pr(y|x) must ensure that y 
depends deterministically on x, so that Pr(y|x) = dy 4(x) 
where y(x) is an encoding function that converts x into 
the index of the neuron that fires in response to x. This 
allows Dyq to be simplified to 


My, 
Dy = 2 fax Pr(x) > Pr(y'ly(x)) Ix —x/(w)IP 


(17) 


where Pr(y’|y(x)) is Pr(y’|y) with y replaced by y(x). 
Note that if Pr(y'|y) = dy," then Dyaq reduces to the ob- 
jective function 2 f dx Pr(x) ||x — x’(y(x))||? for a stan- 
dard vector quantiser (VQ). 


The optimum y(x) is 


given by y(x) = 


ey yet Prty'ly) Ie —x'W)I)? (which is not 
quite the same as the y(x) = ae Ix — x’(y)||? used 


by Kohonen in his topographic mapping neural network 
[1]), and a gradient descent algorithm for updating x’(y’) 
is x/(y’) — x’(y’) + Pr(y’|y(x)) (which is identical 
to Kohonen’s prescription [1]). The Pr(y’|y) may thus 
be interpreted as the neighbourhood function, and the 
x’(y’) may be interpreted as the weight vectors, of a 
topographic mapping. Because all states y that can give 
rise to the same state z (as specified by Pr(z|y)) become 
neighbours (as specified by Pr(y’|y) in equation 15), 
Pr(y’|y) includes a much larger class of neighbourhood 
functions than has hitherto been used in topographic 
mapping neural networks. 


Because of the principled way in which the topographic 
mapping objective function has been derived here, it is 
the preferred way to optimise topographic mapping net- 
works. It also allows the objective function to be gen- 
eralised to the case n > 1, where more than one firing 
event is observed. 
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B. Piecewise Linear Probability of Firing 


The optimal Pr(y|x) has some interesting properties 
that can be obtained by inspecting its stationarity con- 
dition. For instance, the Pr(y|x) that minimise D; + Do 
will be shown to be piecewise linear functions of x. 

Thus, functionally differentiate D,; + D2 with respect 


6 (D, + Dz —L) 2 
6 log Pr (y|x) 


A(n 


AOD) r(x) Pr(ylx) x 


—A(x) Pr(y|x) 


The stationarity condition implies that eae Pr(y|x) 


= — Pr(x) Pr(y|x) [lx — x'(y)| 


6(Di+D2— 
6 Pr(y|x) 


to log Pr(y|x), where logarithmic differentation implicitly 
imposes the constraint Pr(y|x) > 0, and use a Lagrange 

! ! M 
fax! (x!) EM 
pose the normalisation constraint ia 
each x, to obtain 


1 Pr(y’|x’) to im- 
Pr(y|x) = 1 for 


multiplier term LD = 


2 
| 


ea Lree) 


E) _ 0, which may be used to determine the Lagrange 


(18) 


multiplier function \(x). When A(x) is substituted back into the stationarity condition itself, it yields 


M 
0 = Pr(x 


y’=1 


There are several classes of solution to this stationarity 
condition, corresponding to one (or more) of the three 
factors in equation 19 being zero. 


1. Pr(x) = 0 (the first factor is zero). If the input 
PDF is zero at x, then nothing can be deduced 
about Pr(y|x), because there is no training data to 
explore the network’s behaviour at this point. 


2. Pr(y|x) = 0 (the second factor is zero). This fac- 
tor arises from the differentiation with respect to 
log Pr(y|x), and it ensures that Pr(y|x) < 0 cannot 
be attained. The singularity in log Pr(y|x) when 
Pr(y|x) = 0 is what causes this solution to emerge. 


3. ya (Pr(y'|x) — dy) x'(y’)-(---) = 0 (the third 
factor is zero). The solution to this equation 
is a Pr(y|x) that has a piecewise linear depen- 
dence on x. This result can be seen to be intu- 
itively reasonable because D, + D2 is of the form 
J dx Pr(x) f(x), where f(x) is a linear combina- 
tion of terms of the form x* Pr(y|x)’ (for i = 0, 1,2 
and j = 0,1,2), which is a quadratic form in x (ig- 
noring the x-dependence of Pr(y|x)). However, the 
terms that appear in this linear combination are 
such that a Pr(y|x) that is a piecewise linear func- 
tion of x guarantees that f(x) is a piecewise linear 
combination of terms of the form x’ (for i = 0, 1, 2), 
which is a quadratic form in x (the normalisation 
constraint an Pr(y|x) = 
contribution to that is potentially quartic in x). 


1 is used to remove a 
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) Pr(ylx) $5 (Pr(y‘|x) — dy,y) x’(y’) - 


M 
—nx+(n—1) $7 Pr(y"|x)x'(y") (19) 


yl=l 


2) 


Thus a piecewise linear dependence of Pr(y|x) on x 
does not lead to any dependencies on x that are not 
already explicitly present in D; + Dz. The station- 
arity condition on Pr(y|x) (see equation 19) then 
imposes conditions on the allowed piecewise linear- 
ities that Pr(y|x) can have. 


For the purpose of doing analytic calculations, it is much 
easier to obtain analytic results with the ideal piecewise 
linear Pr(y|x) than with some other functional form. If 
the optimisation of Pr(y|x) is constrained, by introducing 
a parametric form which has some biological plausibility, 
for instance, then analytic optimum solutions are not in 
general possible to calculate, and it becomes necessary to 
resort to numerical simulations. Piecewise linear Pr(y|x) 
should therefore be regarded as a convenient theoreti- 
cal laboratory for investigating the properties of idealised 
neural networks. 


1. Solved Example 


A simple example illustrates how the piecewise lin- 
earity property of Pr(y|x) may be used to find optimal 
solutions. Thus consider a 1-dimensional input coordi- 
nate x € [—oo,+o0], with Pr(w) = Pp. Assume that 
the number of neurons M tends to infinity in such a 
way that there is 1 neuron per unit length of x, so that 
Pr(y|x) = p(y — x), where the piecewise linear property 
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gives p(x) as 


1 Iz] << 5-8 
p(n) = 4 Bele gg -sSielsgts 
0 Iz] >54+s 


(20) 


This Pr(y|x) and 2’(y) allow D, to be derived as 


Dj, (per neuron) = 


Po : 
= (1+ 4s”) 


and Dz to be derived as 


Dz (per unit length) = 


Because there is one neuron per unit length, the contri- 
bution per unit length to D; + Dg is the sum of the above 
two results 


Pi 
D, + Dz (per unit length) = = (n Gee 1 4 4s) 
nr 


(23) 
If D,+ Dz is differentiated with respect to s, then station- 


d(Dy +Do2) 


arity condition as = 0 yields the optimum value 


of s as 
n—-1 


ears (24) 


and the stationary value of D, + D2 as 


(2n—1) Po 


D, + Dz (per unit length) = Gnd 
n 


(25) 

When n = 1 the stationary solution reduces to s = 0 
and D; + D2 (per unit length) = Po which is a stan- 
dard vector quantiser with nonoverlapping neural re- 
sponse regions which partition the input space into unit 
width quantisation cells, so that for all x there is exactly 
one neuron that responds. Although the neurons have 
been manually arranged in topographic order by impos- 
ing Pr(y|z) = p(y — x), any permutation of the neuron 
indices in this stationary solution will also be stationary 
solution. This derivation could be generalised to the type 
of 3-layer network that was considered in section IIA , 
in which case a neighbourhod function Pr(y’|y) would 
emerge automatically. 

As n —+ oo the stationary solution behaves as s —> 5 
and D, + Dz (per unit length) —> fo. with overlap- 
ping linear neural response regions which cover the input 
space, so that for all x there are exactly two neurons that 


and by symmetry 2’(y) = y. 


1 ‘1. 
oP. 28 at$ 95-2 Ar+1 
2 f dex? +2 [ ig aetl 2) 
n -lts i 4s 


1 2 
2 2(x—-1 1 
Pde +2 [ (: us “ Jef ) is) 
ih 8 


(22) 


respond with equal and opposite linear dependence on z. 
As n — oo the ratio of the number of firing events that 
occur on these two neurons is sufficient to determine x 
to O (4). When n = ov this stationary solution is s = 4 
and D, + Dz (per unit length) = 0. However, when 
nm = oo there are infinitely many other ways in which 
the neurons could be used to yield D, + Do (per unit 
length) = 0, because only the D2 term contributes, and 
it is 0 when ¢ = 4 Pr(y|z) 2’(y). This is possible 
for any set of basis elements x’(y) that span the input 
space, provided that the expansion coefficients Pr(y|z) 
satisfy Pr(y|a) > 0. In this 1-dimensional example only 
two basis elements are required (i.e. MM = 2), which are 
x'(1) = —oo and w'(2) = +oo. More generally, for this 
type of stationary solution, M = dimx + 1 is required 
to span the input space in such a way that Pr(y|a) > 0, 
and if M < dimx +1 then the stationary solution will 
span the input subspace (of dimension M — 1) that has 
the largest variance. 


The n = 1 and n —> oo limiting cases are very dif- 
ferent. When n = 1 the optimum network splits up the 
input space into non-overlapping quantisation cells, and 
as n —> oo the optimum network does a linear decom- 
position of the input space using non-negative expansion 
coefficients. This behaviour occurs because for n > 1 
the neurons can cooperate when encoding the input 2, 
so that by allowing more than one neuron to fire in re- 
sponse to x, the encoded version of x is distributed over 
more than one neuron. In the above 1-dimensional exam- 
ple, the code is spread over one or two neurons depending 
on the value of x. This cooperation amongst neurons is 
a property of the coherent part D2 of the upper bound 
on Dygq (see equation 8). 
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C. Factorial Encoder Network 


For certain types of distribution of data in input space 
the optimal network consists of a number of subnetworks, 
each of which responds to only a subspace of the input 
space. This is called factorial encoding, where the en- 
coded input is distributed over more than one neuron, 
and this distributed code typically has a much richer 
structure than was encountered in section IIIB. 

The simplest problem that demonstrates factorial en- 
coding will now be investigated (this example was pre- 
sented in [4], but the derivation given here is more di- 
rect). Thus, assume that the data in input space uni- 
formly populates the surface of a 2-torus St x S'. Each 
of the $1 is a plane unit circle embedded in R! x R! and 
centred on the origin, and S$! x $' is the Cartesian prod- 
uct of a pair of such circles. Overall, the 2-torus lives in 
a 4-dimensional input space whose elements are denoted 
as X = (%1,22,%3,24), where one of the circles lives in 
(%1, £2) and the other lives in (x3, 74). These circles may 
be parameterised by angular degrees of freedom 412 and 
034, respectively. 

The optimal Pr(y|x) (i.e. a piecewise linear stationary 
solution of the type that was encountered in section HIB 
could be derived from this input data PDF Pr(x). How- 


ever, the properties of the sought-after optimal Pr(y|x) 
are preserved if one restricts the solution space to the 
following types of Pr(y|x) 


5y,y(O12) OF Sy,y(031) type 1 
Pr(y|x) = y,y(012,034) type 2 
5 5y,y12 (012) a7 Opgiloas) type 3 

(26) 


where y(@12) and yi2(912) encode O12, y(@34) and ys4(O34) 
encode 034, and y(612, 634) encodes (612,634). The al- 
lowed ranges of the code indices are 1 < y(0i2) < M 
(and similarly y(@34)), 1 < yie(O12) < uM u +1< 
y34(O34) < M, and 1 S y(@12, 34) << M. The type 1 
solution assumes that all M neurons respond only to 612 
(or, alternatively, all respond only to 634), the type 2 so- 
lution assumes that all M neurons respond to (612, 034), 
and the type 3 solution (which is very simple type of fac- 
torial encoder) assumes that uy neurons respond only to 


812, and the other u neurons respond only to 634. 


In order to derive explicit results for the stationary 
value of D, + Dg, it is necessary to optimise the x’(y). 
The stationary condition on x’(y) may readily be de- 
0(D1+Dz2) 


my — 0 as 


duced from the stationarity condition 


M 
n [dx Pr(xly) x =x/(y) + (n= 1) f ax Proxly) Do Pry!) (27) 


If Pr(y|x) (and hence Pr(x|y)) are inserted into this sta- 
tionarity condition, then it may be solved for the corre- 
sponding x’(y). 

Assume that the encoding functions partition up the 
2-torus symmetrically, the three types of solution may be 
optimised as described in the following three sections. 


1. Type 1 Solution 


Assume that Pr(y|x) = dy,y(2;,22), and that the y = 
1 quantisation cell is the Cartesian product of the arcs 


M [™ ar ol 
an Jz 27 Jo 


M 


nce 


|912| < 47 and 034 < 27 of the 2 unit circles that form the 
2-torus, then the stationarity condition for x’(1) becomes 


1 


M ie 20 
d034 (cos O12, sin O12, COs 034, sin 034) = x’(1) + (n oa 1) on [- dé19 aaa i d034 x’(1) (28) 
vie -+ 0 


27 


which yields the solution x’(1) = (+ sin (4) , 0,0, 0). The first two components are the centroid of the arc |J| < 45 
of a unit circle centred on the origin. All of the x’(y) can be obtained by rotating x’(1) about the origin by multiples 
of =, Using the assumed symmetry of the solution, the expression for D; + Dz becomes 


M fu 
Di+Do=—= | d012 


um 


M 
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; M ., 7 
(cos 612, sin O12) — (= sin (=) 0) 


2 
1 27 


i dO34 ||(cos 434, sin 034) — (0,0)||? (29) 


0 


8 Some Theoretical Properties of a Network of Discretely Firing Neurons 


where the first (or second) term corresponds to the sub- 
space to which the neurons respond (or not respond). 


This elves: the stationary value of D, + Dz as Dy + D2 = 
2M? 


= * (2). Only one neuron can fire (given x), 
because Pr(y|x) = dy.4(0,2) OF Jy,y(d34), nO further infor- 
mation about x can be obtained after the first firing event 
has occurred, so this result for D; + D2 is independent 
of n, as expected. 


2. Type 2 Solution 


Assume that the y = 1 Cuiniisaiton cell is the Carte- 
sian product of the arcs |J12| < Tit and |¥34| < Ta of 


the two unit circles that form the 2-torus. The station- 
arity condition for x’(1) can be deduced from the type 


where = Ri dO34 > 


hia. 
the expression for D, becomes 


n+l 


1 20 ir 
dé12 a | dO34 (cos O12, sin O12, cos O34, sin 034) = x’(1)+ (n = 1) ue dre 5 ee 
™ JO 


This yields the solution x/(1) = 22 (# sin (3% 


1 case with the replacement M —> VM, which gives 
x'(1) = (& sin (A) ,0, ME sin (Aq) ,0). The ex- 
pression for D, + Dz may similarly be deduced from the 


type 1 case as twice the first term in equation 29 with the 
replacement M —> VM, to yield the stationary value of 


Dy + Dz as D, + Dz = 4— 4M sin? (A). As in the 


type 1 case, this result for D; + D2 is independent of n. 


8. Type 3 Solution 


The stationarity condition for x’(1) can be written 
by analogy with the type 1 case, with the replacement 
M— uM , and modifying the last term to take account 
of the more complicated form of Pr(y|x), to yield 


"Meee '(1) (30) 
0 


=M4y Oy’ yea(034) X (y’) = 0 has been used (this follows from the assumed symmetry of the 


7), 0, 0, 0). Using the assumed symmetry of the solution, 


Pe 


(cos O12, sin 012) 


and the expression for Dz becomes 


2(n—-1) (M fi 
py = =D (Hf [- d012 


n 


This gives the stationary value of D,; + Dz as Dy + 


2 . 
Dz, = 4 Case sin? (27). Because Pr(y|x) = 
4 (Sy y12(612) + Oy aysatesay 3 one es event has to occur 


in each of the intervals 1 < y <¥ and“@+1<y<M 
for all of the information to be éollected about x. How- 
ever, the random nature of the firing events means that 
the probability with which this condition is satisfied in- 
creases with n, so this result for D, + Dz decreases as n 
increases. 


an (3)) 


(cos 612, sin 612) 


20 


27 
+5 f iy (mt-st?) (31) 
0 


n M e QT 0 
Phe CM 


4. Relative Stability of Solutions 


Collect the above results together for comparison. 


4A ame sin 2 (4) type l 
DytD;=% 4-2 six? (7) type 2 (33) 
4 ae sin’ (37) type 3 


For constant M and letting n —> ovo, the value of 
D+ Dz for the type 3 solution asymptotically behaves as 
D+ Dz (type 3) —> 4 Me sin? (47), in which case the 
relative stability of the three types of solution is: type 3 
(most stable), type 2 (intermediate), type 1 (least sta- 
ble). Similarly, for constant n and letting IM —> oo, the 
relative stability of the three types of solution is: type 2 
(most stable), type 3 (intermediate), type 1 (least stable). 

In both of these limiting cases the type 1 solution is 
least stable. If there is a fixed number of firing events n, 
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and there is no upper limit on the number of neurons M, 
then the type 2 solution is most stable, because it can 
partition the 2-torus into lots of small quantisation cells. 
If there is a fixed number of neurons M (which is the 
usual case), and there is no upper limit on the number of 
firing events n, then the type 3 solution is most stable, 
because the limited size of M renders the type 2 solution 
inefficient (the quantisation cells would be too large), so 
the 2-torus S$! x S$! is split into two S$! subspaces each 
of which is assigned a subset of u neurons. If n is large 
enough, then each of these two subsets of neurons has 
a high probability of occurrence of a firing event, which 
ensures that both of the $1! subspaces are encoded. 
More generally, when there is a limited number of neu- 
rons they will tend to split into subsets, each of which 
encodes a separate subspace of the input. The assumed 
form of Pr(y|x) in equation 26 does not allow an un- 
restricted search of all possible Pr(y|x). If the global 
optimum solution (which has piecewise linear Pr(y|x), 
as proved in section IIB) cuts up the input space into 
partially overlapping pieces, then it is well approximated 
by a solution such as one of those listed in equation 26. 
Typically, curved input spaces lead to such solutions, be- 
cause a piecewise linear Pr(y|x) can readily quantise such 
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spaces by slicing off the curved “corners”’ that occur in 
such spaces. 


IV. CONCLUSIONS 


In this paper an objective function for optimising a 
layered network of discretely firing neurons has been pre- 
sented, and three non-trivial examples of how it is applied 
have been shown: topographic mapping networks, piece- 
wise linear dependence on the input of the probability 
of a neuron firing, and factorial encoder networks. Many 
other examples could be given, such as combining the first 
and third of the above results to obtain factorial topo- 
graphic networks, or extending the theory to multilayer 
networks, or introducing temporal information. 
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Modelling the Probability Density of Markov Source * 


Stephen Luttrell 
Room EX21, QinetiQ, Malvern Technology Centre 


This paper introduces an objective function that seeks to minimise the average total number 


of bits required to encode the joint state of all of the layers of a Markov source. 


This type of 


encoder may be applied to the problem of optimising the bottom-up (recognition model) and top- 
down (generative model) connections in a multilayer neural network, and it unifies several previous 
results on the optimisation of multilayer neural networks. 


I. INTRODUCTION 


There is currently a great deal of interest in modelling 
probability density functions (PDF). This research is mo- 
tivated by the fact that the joint PDF of a set of vari- 
ables can be used to deduce any conditional PDF which 
involves these variables alone, which thus allows all in- 
ference problems in the space of these variables to be 
addressed quantitatively. The only limitation of this ap- 
proach to solving inference problems is that a model of 
the PDF is used, rather than the actual PDF itself, which 
can lead to inaccurate inferences. The objective function 
for optimising a PDF model is usually to maximise the 
log-likelihood that it could generate the training set: i.e. 
maximise (log (model probability) ,.aining set’ 

In this paper the problem of modelling the PDF of a 
Markov source will be studied. In the language of neural 
networks, this type of source can be viewed as a layered 
network, in which the state of each layer directly influ- 
ences the states of only the layers immediately above and 
below it. The optimal PDF model then approximates the 
joint PDF of the states of all of the layers of the network, 
or at least some subset of the layers of the network, which 
is a generalisation of what is conventionally done in neu- 
ral network PDF models. 

Markov source density modelling is interesting because 
it unifies a number of existing neural network techniques 
into a single framework, and may also be viewed as a 
density modelling perspective on the results reported in 
[10]. For instance, an approximation to the standard Ko- 
honen topographic mapping neural network [3] emerges 
from density modelling the joint PDF of the input and 
output layers of a 3-layer network, and generalisations 
of the Kohonen network also emerge naturally from this 
framework. 

In section II the relevant parts of the Shannon theory of 
information are summarised [12, 13], and the application 
to coding various types of source is derived [11]; in par- 
ticular, Markov sources are discussed, because they are 
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the key to the approach that is presented in this paper. 
In section II the application of Markov source coding to 
unsupervised neural networks is discussed in detail, in- 
cluding the Kohonen network [3]. In section IV (and ap- 
pendix A) hierarchical encoding using an adaptive clus- 
ter expansion (ACE) is discussed [5], and in section V 
(and appendix B) factorial encoding using a partitioned 
mixture distribution (PMD) are discussed [9]. Finally in 
appendix III density modelling of Markov sources is com- 
pared with standard density modelling using a Helmholtz 
machine [2]. 


II. CODING THEORY 


In section II A the basic ideas of information theory are 
outlined (this discussion is inspired by the reasoning pre- 
sented in [12, 13]), and in section IIB the process of using 
a model to code a source described in detail. In section 
IIC this is extended to the case of a Markov source. In 
section IID the relationship between conventional den- 
sity models and Markov density models is discussed. 

See [12, 13] for a lucid introduction to information the- 
ory, and see [11] for a discussion of the number of bits 
required to encode a source using a model. 


A. Information Theory 


A source of symbols (drawn from an alphabet of M 
distinct symbols) is modelled as a vector of probabilities 
denoted as P 

P=(P,,Py,::: , Pur) (2.1) 
which describes the relative frequency with which each 
symbol is drawn independently from the source P. A 
trivial example is an unbiassed die, which has M = 6 
and P; = é for 7 = 1,2,--- ,6. 

The ordered sequence of symbols drawn independently 
from a source may be partitioned into subsequences of 
N symbols, and each such subsequence will be called a 
message. If N is very large, then a message is called 
“likely” if the relative frequency of occurrence of its sym- 
bols approximates P, otherwise it is called “unlikely”. As 
N — oo the set of likely messages is very sharply defined, 
in the sense that the proportion of all messages that lie 


in the transition region between being likely and being 
unlikely becomes vanishingly small. Thus there is a set 
of likely messages all with equal probability of occurring 
(because each likely message has the same relative fre- 
quency of occurrence of each of the M possible symbols), 
and a set of unlikely messages (i.e. all the messages that 
are not likely messages) that have essentially zero proba- 
bility of occurring. It is this separation of messages into 
a likely set (all with equal probability) and an unlikely 
set (all with zero probability) that underlies information 
theory, as discussed in [12, 13]. 

A likely message from P will be called a likely P- 
message. As N —> oo the number of times n; that each 
symbol 7 occurs in a P-message of length N is ni; = NP,, 
where ae P; = 1 guarantees that the normalisation 


condition pee n; = N is satisfied. The logarithm of the 
number of different likely P-messages is given by (using 
Stirling’s approximation loga! ~ x logx — x when z is 
large) 


N! M 
°8 (<5 ——| x ne 


(2.2) 


Now define the entropy H (P) of source P as the loga- 
rithm of the number of different likely P-messages (mea- 
sured per message symbol): 

M 
H(P)=—-)_P, log P, >0 


i=1 


(2.3) 


Thus H (P) is the number of bits per symbol (on aver- 
age) required to encode the source (assuming a perfect 
encoder), because the only messages that the source has a 
finite probability of producing are the likely P-messages 
that are enumerated in equation 2.2. 

It is usually very difficult to encode the source P 
using H(P) bits per symbol on average. This is be- 
cause although the boundary between the set of likely 
P-messages and the set of unlikely P-messages is sharply 
defined in principle, in practice it is very hard to model 
mathematically. If this boundary is not precisely defined, 
then it is impossible to compute the value of H (P) accu- 
rately. In order to ensure that all of the likely P-messages 
are accounted for, it is necessary for the mathematical 
model of the boundary to lie outside the true bound- 
ary, which thus overestimates the value of H(P). This 
demonstrates that H (P) is in fact a lower bound on the 
true number of bits per symbol that must be used to 
encode the source P. 


B. Source Coding 


The mathematical model (or, simply, the model) of the 
boundary between the set of likely P-messages and the 
set of unlikely P-messages may be derived from a an- 
other vector of probabilities, denoted as Q, whose M el- 
ements model the probability of each symbol drawn from 
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an alphabet of M distinct symbols. If Q = P then the 
boundary is modelled perfectly, and hence in principle 
the lower bound H (P) on the number of bits per symbol 
may be attained, although even this is usually difficult to 
realise constructively in practice. In practical situations 
Q # P is invariably the case, so the problem of coding a 
source with an inaccurate model cannot be avoided. 

Since the only P-messages that can occur are the likely 
P-messages (which all occur with equal probability), the 
number of bits required when using Q to encode P is 
(minus) the logarithm of the probability Ty (P,Q) that 
a Q-message is one of the likely P-messages. IIy (P, Q) 
is given by 


IIy (P, Q) 


N! nm nr MM 
Ie (SS ae 3? ») 
M 
P; 
-NY—P, log (Z) <0 
i=1 


which is negative because the model Q generates likely 
P-messages with less than unit probability. The model Q 
must be used to generate enough Q-messages to ensure 
that all of the likely P-messages are reproduced, which 
requires the basic H (P) bits per symbol (that would be 
required if Q = P), plus some extra bits to compensate 
for the less than 100% efficiency with which Q generates 
likely P-messages (because Q 4 P). The number of extra 
bits per symbol is the relative entropy G (P,Q) 


M Ay 
G(P,Q) = 5_ P, log (Z) >0 
i=l % 


which is aan) or minus the logarithm of the proba- 
bility per symbol that a Q-message is a likely P-message. 
Thus Q is used to generate exactly the number of extra 
Q-messages required to compensate for the fact that the 
probability that each Q-message is a likely P-message is 
less than unity (i.e. ly (P,Q) < 0). G(P,Q) (i.e. rela- 
tive entropy) is the amount by which the number of bits 
per symbol exceeds the lower bound H (P) (i.e. source 
entropy). For completeness, also define the total number 
of bits per symbol H (P) + G(P,Q) as L (P,Q), which 
is given by 


2 


(2.4) 


(2.5) 


L(P,Q) = H(P)+G(P,Q) 


M 
—~ > Plog Q: > 0 


i=l 


I 


(2.6) 


The expression for G (P,Q) provides a means of opti- 
mising the model Q. If the optimisation criterion is that 
the average number of bits per symbol required when 
using Q to encode P should be minimised, then the opti- 
mum model Q,,; should minimise the objective function 


G (P,Q) with respect to Q, thus 


arg min 


Qopi = Q G (P, Q) (2.7) 
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This criterion for optimising a model does not include the 
number of bits required to specify the model itself, such 
as is used in the minimum description length approach 
[11], although the objective function could be extended 
to include such additional contributions. 

G (P,Q) is frequently used as an objective function in 
density modelling, where the source P is the vector of 
observed symbol frequencies. Since Qo, must, in some 
sense, be close to P, this affords a practical way of en- 
suring that the optimum model probabilities Qo»: are 
similar to the source symbol frequencies P, which is the 
goal of density modelling. 


C. Markov Source Coding 


The above scheme for using a model Q to encode sym- 
bols derived from a source P may be extended to the 
case where the source and the model are L-layer first 
order Markov chains. The word “layer” is used in antici- 
pation of the connection with multilayer neural networks 
that will be discussed in section III. Thus split up both 
of P and Q into their constituent transition probabilities 


oo (ee, pio... pe pie) 
= (rine? sath pep) 
Q = (Q°,Q"", as pq te gti) 
= (Q%, QI, Qe, Q*) (2.8) 


These two ways of decomposing P (and Q) are equiv- 
alent, because a forward pass through a Markov chain 


Mo 


pot pil? 
a Wy om 42 Te 


to=l ip=l 


I 


G(P,Q) 


L-1 Mi+i 
Da eee S 
141 


l=0 U41= 1 


I 


tL —1,tL 


ll 1/2 
E-1|L pb to,i1 * t1,i2 7" 
. PLL ph jog ( - 


tia ae oi) iG Ge Q”) 


may be converted into a backward pass through a dif- 
ferent Markov chain, whose transition probabilities are 
uniquely determined by applying Bayes’ theorem to the 
original Markov chain. P*!! (and Q*!!) is the matrix 
of transition probabilities from layer | to layer k of the 
Markov chain of the source (and model), P® (Q°) is the 
vector of marginal probabilities in layer 0, P” (and Q’) 
is the vector of marginal probabilities in layer L. This 
may be written out in detail as 


P2 = true probability that layer 0 has state ig 
PL = true probability that layer L has state iz 


IHL + , 
ie ea) probability that layer 1+ 1 has state 741 
given that layer / has state 7, 
Wit. oe = : 
iigiga = brue probability that layer / has state 2, 
given that layer !+ 1 has state 7,44 (2.9) 
i = model probability that layer 0 has state io 
Ei 


i, = model probability that layer D has state iz 
Gas = model probability that layer 1+ 1 has state 741 
given that layer / has state 7, 

Oa = model probability that layer | has state 7, 


given that layer |/+ 1 has state 7,4 (2.10) 


The number of extra bits per symbol G (P,Q) (see 
equation 2.5) required to encode each symbol from the 
source P using the model Q may then be written as 


L-1|L pu 

, ae Eee 
1/2 L-WL AL 

Qiovia Ons m * Sane Qi 


(2.11) 


where the flow of influence in both P and Q is from layer 0 to layer L. The suffix 7j;4; that appears on the 


G 


iy, (P""*!, Q''1) indicates that the state of layer 1 + 1 is fixed during the evaluation of G; 


WI+1 Qil+1 
(Eyes aee) 


‘41 


(i.e. it is the relative entropy of layer /, given that the state of layer 1+ 1 is known). Similarly, the total number of bits 
per symbol required to encode each symbol from the source P using the model Q is L (P,Q) (ie. H(P)+G(P, Q)), 


which is given by 


L-1 Mi+1 


=), 2 fia ® 


1=0 t%41=1 


This result has a very natural interpretation. Both the 
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as (pi Qi) +L (PE, Q") 


(2.12) 


source P and the model Q are Markov chains, and cor- 


responding parts of the model are matched up with cor- 
responding parts of the source. First of all, the number 
of bits required to encode the L*” layer of the source is 
L (Pe, Q’). Having done that, the number of bits re- 
quired to encode the L — 1” layer of the source, given 
that the state of the L*” layer is already known, is 
L (P24, Q*-I), which must then be averaged over 
the alternative possible states of the L’” layer to yield 
Dee PL L (PLL Q*-UL). This process is then re- 
peated to encode the L — 2*” layer of the source, given 
that the state of the L — 1*” layer is already known, and 
so on back to layer 0. This yields precisely the expression 
for L (P,Q) given above. 

Bayes’ theorem (in the form Pe 1 = 
may be used to rewrite the expression for L (P,Q) so 
that the flow of influence in P and Q runs in opposite 
directions. Thus 


= pi prt’ ) 


Uy 
dy 415% 


L-1 M 
L(P,Q) = SS s- P ne (eae qil+t) oO (Pe, Q*) 
1=0 ij=1 
(2.13) 
where K;, (P’*1!’, Q!I'*1) is defined as 


Misi 
He (POS: | Bas eine 
uyi=l 
(2.14) 
The expression for L (P, Q) in equation 2.13 has an anal- 
ogous interpretation to that in equation 2.12. 

Other types of Markov chain may also be considered, 
such as ones in which some of the layers are not included 
in the calculation of the number of bits required to encode 
the source. One such example is discussed in section 
IID. 


D. Alternative Viewpoints 


The relationship between conventional density models 
and Markov density models can be stated from the point 
of view of a conventional density modeller. The goal is 
to build a density model Q° of the source P®, such that 
the number of bits per symbol L (P°, Q°) required to en- 
code P® is minimised. However, if the source P® is trans- 
formed through FL layers of a network to produce a trans- 
formed source P*, then L(P°,Q°) < L(P,Q) where 
L (P,Q) is given in equation 2.12, which is the sum of 
the number of bits per symbol L (PY, Q’) required to en- 
code P®”, plus (for |] = 0,1,--- , L—1) the number of bits 
per symbol ee ae Li,,, (P!*4, Q""+7) required 
to encode P!l/+1, 

Thus the problem of encoding the source P° can be 
split into three steps: transform the source from P° to 
P”, encode the transformed source P’, and encode all 
of the transformations P!!'+1 (for 1 = 0,1,--. ,L— 1) to 
allow the original source to be reconstructed from the 
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transformed source. The total number of bits L (P,Q) 
required to encode P” and P!!!+! (for ] = 0,1,--. ,L—1) 
is then an upper bound on the total number of bits 
L (P°,Q®°) required to encode P°. In this picture, a 
Markov chain is used to connect the original source P® to 
the transformed source P”, so the Markov chain relates 
one conventional density modelling problem (i.e. opti- 
mising Q°) to another (i.e. optimising Q”). 


The above description of the relationship between con- 
ventional density models and Markov density models was 
presented from the point of view of a conventional density 
modeller, who asserts that the goal is to build an opti- 
mum (i.e. minimum number of bits per symbol) density 
model Q° of the source P®. From this point of view, 
the Markov chain is merely a means of transforming the 
problem from modelling the source P® to modelling the 
transformed source P’. That this transformation pro- 
cess is imperfect is reflected in the fact that more bits 
per symbol are required to encode the transformed source 
P¥ (plus the state of the Markov chain that generates 
it) than the original source P®. A conventional density 
modeller might reasonably ask what is the point of us- 
ing Markov density models, if they give only an upper 
bound on the number of bits per symbol for encoding 
the original source P°? 


However, it is not at all clear that the conventional den- 
sity modeller is using the correct objective function in the 
first place. Why should the number of bits per symbol for 
encoding the original source P® be especially important? 
It is as if the world has been separated into an external 
world (i.e. P®) and an internal world (i.e. the P!+1!! for 
1=0,1,--- , 2-1), and a special status is accorded to the 
external world, which deems that it is important to model 
its density P° accurately, at the expense of modelling 
the P'+1!! accurately. In the Markov density modelling 
approach, this artificial boundary between external and 
internal worlds is removed, because the Markov chain 
models the joint density (P°,P1!°,.-- ,P“l4~1), where 
P°® and the P!+1!! are all accorded equal status. This 
even-handed approach is much more natural than one in 
which a particular part of the source (ie. the external 
source) is accorded a special status. 


In the language of multilayer neural networks, the vec- 
tor (P°, P°l!,... ,P4!4—1) is the source which comprises 
the bottom-up transformations (or recognition models) 
which generate the states of the internal layers of the net- 
work, and the vector (Q°', Q!I?,.-- ,Q”) is the model 
of the source which comprises the top-down transforma- 
tions (or generative models). Thus the network is self- 
referential, because it forms a model of a source that 
includes its own internal states. This self-referential be- 
haviour is present in both the conventional density mod- 
elling and in the Markov density modelling approaches, 
but whereas in the former case it is not optimised in an 
even-handed fashion, in the latter case it is optimised in 
an even-handed fashion. 
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Ill. APPLICATION TO UNSUPERVISED 


NEURAL NETWORKS 


In section II]A the theory of Markov source coding 
(that was presented in section IIC) is applied to a mul- 
tilayer neural network. In section IIIB this approach is 
applied to a 2-layer neural network to obtain a soft vec- 
tor quantiser (VQ), which is generalised to a multilayer 
neural network in section IIIC to obtain a network of 
coupled soft VQs. In section III D it is shown how an ap- 
proximation to Kohonen’s topographic mapping network 
can be derived from the theory of Markov source cod- 
ing. Finally, some additional results are briefly dicussed 
in section IITE. 


A. Source Model of a Layered Network 


In this section the optimisation of the joint PDF of 
the states of all of the layers of an (£ + 1)-layer encoder 
of the type that was discussed in section ITC will be 
considered. It turns out that this leads to new insights 
into the optimisation of a multilayer unsupervised neural 
networks. 

The Markov chain source Pp = 
(P°, Pile... ,Pl-z-2 priz-1) (or, equivalently, 
P = (Po, pt?,...,Ppe-l2 PL)) may be used to 
describe the true behaviour (i.e. not merely a model) of 
a layered neural network as follows. P® is an external 
source, and (eae, ee eee eo! Pale) is an internal 
source, where external/internal describes whether the 
source is outside/inside the layered network, respectively. 
P'+1! is not part of the source itself (i.e. the external 
source), rather it is a transition matrix that describes 
the way in which the state of layer J of the neural 
network influences the state of layer | + 1. There is an 
analogous interpretation of P’ and the P!!!+1, 

The Markov chain model Q — 
(Q®, QU,... Qe-He-2 QzUE-1) (or, — equivalently, 
Q = (Q*!, Qt?,.-» ,Q?’-4, Q”)) may then be used as 
a model (i.e. not actually the true behaviour) of a lay- 
ered neural network. Q has an analogous interpretation 


ak > J dx 

P M, M 

tie. Daa gd 
P? - Pr (x) 

Pio _s Pr (y|x) 


t1,t0 


19 > X 


to P, except that it is a model of the source, rather than 
the true behaviour of the source. 

It turns out to be useful for the true Markov behaviour 
(i.e. P) and the model Markov behaviour (i.e. Q) to 
run in opposite directions through the Markov chain. 
Thus P = (Peps, vee ,pe ee Pee (flow of in- 
fluence from layer 0 to layer L of the Markov chain) 
and Q = (Q°', Q??,... ,Q’-4, Q*”) (flow of influence 
from layer L to layer 0 of the Markov chain). In the con- 
ventional language of neural networks, P is a “recognition 
model” and Q is a “generative model’. Note that the 
use of the word “model” in the terminology “recognition 
model” is strictly speaking not accurate in this context, 
because P is a source, not a model. However, terminol- 
ogy depends on one’s viewpoint. In Markov chain density 
modelling P is a source when viewed from the point of 
view of the model Q. Whereas, in conventional density 
modelling P° is a source when viewed from the point of 
model Q®, in which case (P1°,... ,P~14-?, pLlE—1) 
is a recognition model and (Q°', Q1I?,... ,Q’-4,Q*) 
is a generative model. 


B. 2-Layer Soft Vector Quantiser (VQ) Network 


The expression for L (P, Q) in equation 2.13 has a sim- 
ple internal structure which allows it to be systematically 
analysed. Thus apply equation 2.13 to a 2-layer network 
where 


P = (P*, Pu?) 


Q = (QQ) (3.1) 


to obtain the objective function 


Mo M, 
L (P, Q) ee ey Ee ‘e a log Oe +L (PB Q') 


io=l1 yal 


(3.2) 
Now change notation in order to make contact with pre- 
vious results on vector quantisers (VQ) 


input vector 
output code index 
input PDF 


recognition model 


(3.3) 


O|1 1 
Or >V pee exp ( Io2 


i, 2 Q(y) 


where x is a continuous-valued input vector (e.g. the ac- 
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) Gaussian generative model 


output prior 


tivity pattern in layer 0), o is the (isotropic) variance of 


the Gaussian generative model, V is an infinitesimal vol- 
ume element in input space which may be used to convert 
the Gaussian probability density into a probability, and y 
is a discrete-valued output index (e.g. the location of the 
next neuron to fire in layer 1). Note that the parameter 
V must be introduced in order to regularise the number 
of bits required to specify each source state. In effect, V 
specifies a resolution scale, such that details on smaller 
scales are ignored. 

The notation defined in equation 3.3 allows L (P,Q) 
to be written as 


Dve 


L(P,Q)= Ao2 +L (P*,Q’) log (am) 
(3.4) 


where Dyg is defined as 


M 
Dvq =2 [ dx Pr(x) Yo Pr(ylx) [xx (WII? (3.5) 


The first term in equation 3.4 is proportional to the ob- 
jective function Dyg for a soft vector quantiser (VQ), 
where Pr(y|x) is a soft encoder, and x’ (y) is the cor- 
responding reconstruction vector attached to code index 
y, and ||x — x’ (y)||? is the L? norm of the reconstruction 
error. A standard VQ [4] (ie. winner-take-all encoder) 
has Pr(y|x) = dy (x), which emerges as the optimal 
form when this VQ objective function is minimised w.r.t. 
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Pr (y|x) (see [10] for a detailed discussion of these issues). 
The second term in equation 3.4 (ie. L (P1,Q')) is the 
cost of coding the output layer, and the third term is 
constant. 

The effect of the L(P',Q') term in L(P,Q) (see 
equation 3.4) is to encourage P} —> 6;,;, (only one state 
in layer 1 is used) and Q' — P! (perfect model in 
layer 1). The behaviour P} + 6; is in conflict with 
the requirements of the first term (i.e. the soft VQ) in 
L(P, Q), which requires that more than one state in layer 
1 is used, in order to minimise the reconstruction distor- 
tion. There is a tradeoff between increasing the number 
of active states in layer 1 in order to enable the Gaus- 
sian generative model (Q® is a Gaussian mixture distri- 
bution) to make a good approximation to the external 
source P®, and decreasing the number of active states in 
layer 1 in order to make the average total number of bits 
L (Pi, Q') required to specify an output state as small 
as possible. 


C. Coupled Soft VQ Networks 


The results of section IIIB will now be generalised to 
an (£ + 1)-layer network. The objective function for cod- 
ing a Markov source (equation 2.13) can be written, using 
a notation which is analogous to that given in equation 
3.3 as 


L(P,Q)= 3 Prag +L(P*,Q*) Ys ( a = (3.6) 
i= 4 (21) 1=0 (2m 01) a 


where Di, is defined as (Dg = Dyg as defined in 
equation 3.5) 


Mi41 
Dig =2 i dx; Pr(xi) S> Pr (ngs le) [pa — 3 Cw)? 


yi41=1 

(3.7) 
where x; and y, are both used to denote the state of 
layer 1. The notation x; is used to denote the input to 
the encoder that connects layers | and | + 1, whereas 
the notation y; denotes the output of the encoder that 
connects layers |! — 1 and |. This redundancy of notation 
is not actually necessary, but is used here to preserve the 

distinction between input vectors and output codes. 
The first term in equation 3.6 is a weighted sum (where 
each term is weighted by (01)~7) of objective functions 
for a set of soft VQs connecting each of the L neighbour- 
ing pairs of layers in the network. This type of network 
structure will be called a VQ-ladder. The second term in 
equation 3.6 (i.e. L (PY, Q’)) is the cost of coding the 


output layer, and the third term is constant. 

If the cost L (ee, Q’) of coding the output layer is ig- 
nored, then the multilayer Markov source coding objec- 
tive function L (P, Q) is minimised by minimising the the 

L 
objective function 2, ae for a VQ-ladder (see [10] 
discussion of this point in the context of folded Markov 
chains (FMC)). As the number L of network layers is in- 
creased, the effect of the L (PY, Q’) term has less and 
less effect on the overall optimisation, because its effect 
is swamped by the VQ-ladder term. 


D. Topographic Mapping Network 


The results obtained in section IIIB for a soft VQ may 
be generalised to obtain a topographic mapping network 
whose properties closely resemble those of a Kohonen net- 
work [3]. This derivation is based on the approach to 
topographic mappings that was presented in [6]. Thus 
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apply equation 2.13 to a 3-layer network 


ps (p>, pp!) 
Q = (Q", Q1?, Q?) (3.8) 
p= (ee, pep) 
= (Q, qi? Q?) (3.9) 
| 
M 
We. FE aR J dx 
yy 
12 > Z 
P? — Pr (x) 
1/0 
oe ee S (y|x) 
2) 
Pl # Pr (z|y) \ \ 
o|2 x—x'(z) 
Qin is a> V Veo exp (- ae 
2, 7 Q(z) 
to obtain 
Dva 2-2 Vv 
L (P,Q) = =¥2 + 1 (P?, Q”) — log =< 
do? (/2n a)" 
(3.12) 


where Dyg is defined as 


M2 
Dyg= 2 | ax Pr (x) §> Pr(z|x) |x — x! (2)|/° (3.13) 


which should be compared with the objective function in 
equation 3.5. 


Mi Mo 


Dvg = f dx Pr(x) 9 Pr (yb) Yo Pr (elu) Yo Pr(u'le) ff ax’ Pr (x'ly’) Ibe — x’? 


and rearrange to obtain 


Mi 
Dyg= [ou Pr (x) = Pr (vx) fax’ Pr (x’|y’) ||x — x! ||? 
y=1 
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where only layers 0 and 2 are included in the objective 
function, to obtain 


Mo Mp 
(P,Q) =— D5 PRY Pik, log Qa, + L (P?, Q”) 
ig=1 ig=1 


(3.10) 
which should be compared with equation 3.2. An anal- 
ogous change of notation to that defined in equation 3.3 
can be made 


input vector 

hidden code index 

output code index 

input PDF 

recognition model (first stage) (3.11) 


recognition model (second stage) 
Gaussian generative model 


output prior 


This expression for Dyg explicitly involves the states 
of layers 0 and 2 of a 3-layer network, and it will now be 
manipulated into a form that explicitly involves the states 
of layers 0 and 1. In order to simplify this calculation, 
Dyg will be replaced by the equivalent objective function 
[19] 


Dve 


M2 
= [ax Pr(x) Do Pr(zlx) f dx’ Pr (x2) IIx — x"? 


(3.14) 
Now introduce dummy integrations over the state of layer 
1 to obtain 


M, 
(3.15) 


i 


(3.16) 


where 


M2 
Pr(y'ly) = S— Pr(y'|z) Pr (zly) 


M, 
Pr(y/[x) = S° Pr(y/|y) Pr (ylx) 


y=1 


(3.17) 


which may be replaced by the equivalent objective func- 
tion 
M 


Dg =2 f dx Pr(x) Y> Pr(y'bs) [b= x’ (y)IP 


(3.18) 
which should be compared with the objective function in 
equation 3.13. 

The overall effect of manipulating equation 3.13 into 
the form given in equation 3.18 is to convert the objec- 
tive function from one that explicitly involves the states 
of layers 0 and 2, to one that explicitly involves the states 
of layers 0 and 1. This change is reflected in the replace- 
ment of Pr (z|x) by Pr (y’|x). This new form for the ob- 
jective function (see equation 3.18) is exactly the same 
as for a standard VQ (see equation 3.5), except that the 
posterior probability Pr (y|x) is now processed through a 
transition matrix Pr(y’|y) to produce Pr (y’|x). Because 
Pr(y'|y) = See, Pr (y’|z) Pr(zly), it takes account of 
the effect of the state z of layer 2 on the training of layer 
1, which is a type of self-supervision [7] in which higher 
layers of a network coordinate the training of lower layers. 
However, viewed from the point of view of layer 1, the 
effect of the transition matrix Pr(y’|y) is to do damage 
to the posterior probability by redistributing probability 
amongst the states of layer 1. This process is thus called 
probability leakage, and Pr(y’|y) is called a probability 
leakage matrix. 

The objective function in equation 3.18 gives rise to a 
neural network that closely resembles a Kohonen topo- 
graphic mapping neural network [3], where Pr(y’|y) may 
be identified as the topographic neighbourhood function, 
as was shown in [6]. Note that in order for the topo- 
graphic neighbourhood to be localised (i.e. Pr(y’|y) > 0 
only for y’ in some local neighbourhood of y), the tran- 
sition matrix Pr(z|y) that generates the state of layer 2 
from the state of layer 1 must generate each z state from 
y states that are all close to each other. This connection 
with Kohonen topographic mapping neural networks is 
only approximate, because the training algorithm pro- 
posed by Kohonen does not correspond to the minimi- 
sation of any objective function. A generalised version 
of the Kohonen network which allows a factorial code to 
emerge may be derived using the results in section V [9]. 


E. Additional Results 


L 
The objective function Sa ee for a VQ-ladder 


couples the optimisation of the individual 2-layer VQs 
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together. Because the output of the I” VQ is the input 
to the (J +1)" VQ (for 1 = 0,1,2--- , L—1), the optimi- 
sation of the k*” VQ has side effects on the optimisation 
of the * VQs (for 1 = k+1,k4+2,---,L—1). This 
leads to the effect called self-supervision, in which top- 
down connections from higher to lower network layers are 
automatically generated, to allow the lower layers to pro- 
cess their input more effectively in the light of what the 
higher layers discover in the data [7]. This is the multi- 
layer extension of the self-supervision effect that led to 
topographic mappings in section IID. 

The general expression for L (P,Q) in equation 2.13 
is the sum of two terms: the objective function 

a. ey P! Ki, (P'*1', Q''1) for a ladder (because 
Q is not necessarily Gaussian, the ladder is not necessar- 
ily a VQ-ladder), plus the cost D (ee : Q’) of encoding 
layer L. The L (P Q’*) term has precisely the form 
that is commonly used in density modelling, so any con- 
venient density model could be used to parameterise Q/ 
in layer L. A typical implementation of the type of net- 
work that minimises L (P,Q) thus splits into two pieces 
corresponding to the two different types of term in the 
objective function. In the special case where L = 0 (i.e. 
no ladder is used) this approach reduces to standard in- 
put density modelling. 


IV. HIERARACHICAL ENCODING USING AN 
ADAPTIVE CLUSTER EXPANSION (ACE) 


In this section the adaptive cluster expansion (ACE) 
network is discussed [5]. ACE is a tree-structured net- 
work, whose purpose is to decompose high-dimensional 
input vectors into a number of lower dimensional pieces. 
In section IV A the case of a deterministic source and a 
perfect model is considered, and in section IV B the case 
of a Gaussian model is discussed. 


A. ACE: Tree-Structured Density Network 


Consider the objective function L (P,Q) for encoding 
an L + 1 layer Markov source (see equation 2.13), and 


assume that the Ou part of the model is perfect so 


tt44 
that QU), = Pitt (for 1=0,1,---,L—1), and that 
the freee part of the source is deterministic so that 
1+1|l . : 
pe = Oar g syinga (in) (for 1 = 0,1,--- ,£— 1), in which 


case L (P,Q) simplifies as follows (see appendix A 1) 


L(P,Q)=H(P*)-H(P*)+L(P",Q") (41) 
where H (P°) — H (P”) is the number of bits per sym- 
bol required to convert a P’-message into a P°-message, 
assuming that the aed part of the source is determin- 
istic, and that the model is perfect. This result is not 
very interesting in itself. 
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However, if the Le part of the source is not only 


deterministic, but is also tree-structured, and the model 


is similarly tree-structured, then the notation must be 
modified thus 


; : sl 32 
uay = (ij, i7,---) 
; - “1 “2 
U4. 7 41 = (iia, 410° -) 
+1! I+1|I — pltll pli _ 
Fata a os ay ~ fe ae vip iby ity (8) 912, 2, (82) 
jl+1 Qi+1 D|t-+1 CAS al 
Qi ina? Q} i = Pe jl 4 rp eA (4.2) 
ytl4 Wjl+1 Wostiy, Motta 


where the state 2; of layer | of the tree-structured Markov 
source is more naturally written as a vector state i; that 
specifies the joint state of each branch of layer | of the 
tree (the i; style of notation is more suitable for a non- 
tree-structured Markov source). Furthermore, the com- 
ponents of the vector i; are partitioned as (i},i7,---), 
where each if is the joint state of a subset c of nodes in 
layer J, where all the nodes in each subset are all siblings 


Le,.Q)=5 De H(P)- S> H(PL) +L (P4,Q") 


l=1 component c 


l=0 cluster c 


as seen from the point of view of layer 1+ 1. Such a set of 
siblings is called a cluster. The components of the vector 
ij41 are partitioned as (i/,,,77,,,--:), where if, , is the 
state of the parent (in layer /+1) of the siblings in cluster 
c in layer J. 


This notation may be used to rearrange L (P,Q) as 
follows (see appendix A 2) 


(4.3) 


This expression for L (P,Q) can be rewritten in terms of the mutual information I (P!) between the components of 


cluster if, ; as (see appendix A 2) 


l=1 cluster c 


r(Pi)+ 


Now assume that the model is perfect in the output 

layer, so that Q” is given by Of — Be Pe --+, This 
L L 

allows L(P*,Q*) to be simplified as L (P*,Q*) = 

De ate de, (PZ), so that L(P,Q) may finally be ex- 

pressed as 


L 


l=1 cluster c 


T(PL)+ So A(P8) (45) 


cluster c 


The — eis custer ¢£ (P!) term is (minus) the sum of 
the mutual informations within all of the clusters in the 
L +1 layer network, and the ojuster ¢ H (P2) term is 
constant for a given external source P®. This means 
that minimising D (P,Q) is equivalent to maximising 
SS austere 2 (P4). This is the maximum mutual in- 
formation result for ACE networks [8], which includes 
the mutual information maximisation principle in [1] as 
a special case. 
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cluster c 


H (P®) - S- H (PL) +L (P*,Q”) 


cluster c 


(4.4) 


Note that if the source is deterministic and the model 
is perfect (as they are here), then L (P°, Q°) = L (P,Q), 
which implies that input density optimisation is equiva- 
lent to joint density optimisation. This equivalence was 
used in [8], where the sum-of-mutual-informations objec- 
tive function was derived by minimising L (P°, Q°). 


B. ACE: Hierarchical Vector Quantiser 


If the above ACE network is modified slightly, so that 


the model Q has exactly the same structure as before, but 
Ul+1 


is Gaussian rather than perfect, then Qi anes becomes 
YI+1 YIt1 _ All+1 All+1 
Givin iis = Sina, Cia, , (4.6) 


m Aoee Ul+1 F 
where the individual Qi! .s are Gaussian. The expres- 
Lvl+1 


sion for L (P,Q) may then be written down by analogy 


10 


with equation 3.6 


25 ys 


l=0 cluster c (a1, c) 


Thus the ACE network, with a Gaussian model Q, is a 
hierarchical VQ-ladder (or VQ-tree), in which each layer 
encodes the clusters in the previous layer [6]. 


V. FACTORIAL ENCODING USING A 
PARTITIONED MIXTURE DISTRIBUTION 
(PMD) 


In this section a useful parameterisation of the condi- 
tional probability P'*+1!! for building the Markov source 
is introduced in order to encourage P+"! to form facto- 
rial codes of the state of layer J. It turns out that there is 
a simple way of allowing such codes to develop, which is 
called the partitioned mixture distribution (PMD) [9]. A 
PMD achieves this by encoding its input simultaneously 
with a number of different recognition models, each of 
which potentially can encode a different part of the in- 
put. 

In section V A two ways in which multiple recognition 
models can be used for factorial encoding are discussed, 
and a hybrid approach (which is a PMD) is discussed in 
section VB. 


A. Multiple Recognition Models 


In the expression for the L (P,Q) (see equation 2.13) 
the generative models Q!!'+! may be parameterised as 
Gaussian probability densities, whereas the recognition 
models P!+1!! may be parameterised in a more general 
way as 


1ji+1 + 
I+1|1 Pi aa Pi, : 
P. —— 


Ut1541 ercan pit 1+1 
H4y=1 Uta that 


Dvq <2 f dx Pr(x 


L (Pt ,Q’) - 


pe dese 3 Pr (y1|x, 1) Pr (y2|x, 2) -- 
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V; 
Ss ( eae) a7 
1=0 cluster c 27 oO] i 
which guarantees the normalisation condition 
paaaer as = 1. A limitation of this type of 


recognition model is that it allows only a single expla- 
nation 741 of the data 7 (in the case of a hard Po 
or a probability distribution over single explanations (in 
the case of a soft ee 


encoding of the data. 


), so it cannot lead to a factorial 


The simplest way of allowing a factorial encoding to 
develop is to make simultaneous use more than one recog- 
nition model. Each recognition model uses its own P!+1 
vector and P!l/+! matrix to compute a posterior prob- 
ability of the type shown in equation 5.1, so that if 
each recognition model is sensitised to a different part 
of the input, then a factorial code can develop. This 
approach can be formalised by making the replacement 
i141 — i41 in equation 5.1 (i.e. replace the scalar code 
index by a vector code index, where the number of vector 
components is equal to the number of recognition mod- 
els). If the components of ij,, are determined indepen- 
dently of each other, then their joint posterior probability 


ca is a product of independent posterior probabili- 
ties, where each posterior probability corresponds to one 
of the recognition models, and thus has its own P!+! 


vector and P!l/+1 matrix. 


If this type of posterior probability, which is a prod- 
uct of n independent factors if there are n independent 
recognition models, is then inserted into equation 3.5 it 
yields (see appendix B) 


n 


1 
x X, (yr) 
n 
k=1 


Pr (Yn|X, 7) (5.2) 


If a single recognition model is independently used n times, rather than n independent recognition models each 
independently being used once, then the above result becomes 


M 
Dva <= | dx Pr (x) > Pr (ups) [bx x’ (y)| 


2 2{n =) 


i Pr (x 
n 
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) fe SePs (y|x) x’ (y) 


(5.3) 
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In the case n = 1 this correctly reduces to equation 3.5 
(the inequality reduces to an equality in this case). When 
n > 1the second term offers the possibility of factorial 
encoding, because it contains a weighted linear combina- 
tion Be Pr (y|x) x’ (y) of vectors. 


B. Average Over Recognition Models 


Now combine the above two approaches to factorial 
encoding, so that a single recognition model is used (as 
in equation 5.3), which is parameterised in such a way 
that it can emulate multiple recognition models (as in 
equation 5.2). The simplest possibility is to firstly make 
the replacement Pee > Ans tae (where AM > 0) 
in equation 5.1, where k is a recognition model index 
which ranges over k = 1,2,---,K (note that K is not 
constrained to be the same as n), and then secondly to 
average over k, to produce 


K Il+1  yltl 1+1 
poll 5 1 S- Pe saa kits ene 
41,4 M, U[1+1 
+ K a 41 p [I+ Altl pit 


: ot 
4 kta) thay 


(5.4) 
Uyi=1 thy 
In effect, K recognition models are embedded between 
layer / and layer +1, and the A'+! matrix specifies which 
indices 274; in layer /+ 1 are associated with recognition 
model k. 

The result in equation 5.4 is not the same as the result 
that would have been obtained using a Bayesian anal- 
ysis, in which the posterior probabilities generated by 
different models are combined to yield a single posterior 
probability. In appendix B there is a discussion of the 
relationship between the above proposed PMD recogni- 
tion model and a full Bayesian average over alternative 
recognition models. 

A partitioned mixture distribution (PMD) is precisely 
this type of multiple embedded recognition model. In 
the simplest type of PMD the A'+! matrix is chosen to 
contain only 0’s and 1’s, which are arranged so that the 
K recognition models partition layer 1+ 1 into K over- 
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lapping patches [9]. A wide range of types of PMD can 
be constructed by choosing A!+! appropriately. 

In section IID it was shown how a Kohonen topo- 
graphic mapping emerged when a 3-layer Markov source 
network was optimised. If the PMD posterior probability 
(see equation 5.4) had been used in section IIID, then a 
more general form of topographic mapping (i.e. a facto- 
rial topographic mapping) would have emerged (this is 
briefly discussed in [9]). 


VI. CONCLUSIONS 


The objective function for optimising the density 
model of a Markov source may be applied to the problem 
of optimising the joint density of all the layers of a neu- 
ral network. This is possible because the joint state of all 
of the network layers may be viewed as a Markov chain 
of states (each layer is connected only to adjacent lay- 
ers). This representation makes contact with the results 
that were reported in [10], and allows many results to 
be unified into a single approach (i.e. a single objective 
function). 

The most significant aspect of this unification is the 
fact that all layers of a neural network are treated on an 
equal footing, unlike in the conventional approach to den- 
sity modelling where the input layer is accorded a special 
status. For instance, this leads to a modular approach to 
building neural networks, where all of the modules have 
the same structure. 
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Appendix A: ACE 


In this appendix some of the more technical details relevant to section IV are given. 


1. Perfect Model, Deterministic Source 


The derivation of the result in equation 4.1 for a perfect model (ie. Q =P) and a deterministic source (i.e. 


itil _ 5 


U41 541 i141 4141 (41) 


using scalar notation 7; rather than vector notation i; for the state of layer 1, because here the 


Markov source is not assumed to be tree-structured) is as follows. The basic definition of L (P,Q) in equation 2.13 
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may be written as 


L-1 M, Mi+1 


E(P,Q)- 1 (P*,Q*) =->> OP, YD Pi os Pt, (Al) 


l= 0 y= dt U41= 1 


pitt pl 
This may be simplified by noting that ie = irs s.ir41 (i), and that Bayes’ theorem gives P| poe = HA, which 
41 
yields 
L-1 M Mi41 5 G )P} 
t a a 
L (Ps Q) a (P’, Q”) el SS bez > Siar, uti (i) log = pai (A2) 
1=0 2,=1 t41=1 U4. 


Mi+1 = M i _— plt+i 
Now use that ee 1 Onaqitini Gi) =1, ee a Onardiians log OnaA anes = 0 and et P;, Sings sings (i) = Pes to 
reduce this to 


L-1 M L-1 Mi41 
L(P,Q)—L(P*,Q*) =—S7 97 Pi log Pi + D7 D2 Pitt log Pitt (A3) 
1=0 ij=1 1=0 ij41=1 


The terms in these two series mostly cancel each other to yield 


Mr 
L(P,Q)—L(P*,Q*) = yt pe log PR + >) PE log PP (A4) 
io=l ip=l 


and using the definition of entropy (see equation 2.3) this may finally be written as 


L(P,Q) — L(P’,Q”) =H (P°) - H(P”) (A5) 


2. Perfect Model, Deterministic Source: Tree-Structured Case 


The derivation of the result in equation 4.3 for a perfect tree-structured model and a deterministic tree-structured 
source is may be obtained by altering the notation in appendix A 1 to reflect the fact that both the Markov source 
and model are now tree-structured. Thus use the notation defined in equation 4.2 to write L (P,Q) (see equation 
2.13) as 


L(P,Q) - L(P*,Q*) = S558 Dae oy eR. (A6) 


l1=0 iy iy4i cluster c 
ees Pee Pie 
Now use Bayes’ theorem in the form Pic = ee to write this as 
tty Tay 
I+1|1 
L L) piri Pie Pre 
L®.Q)- 1 (P49) =-F OAT A SS toe = (A7) 
l=0 i; ij4a cluster c tay 


This may be simplified by using that >>, pret = 4 (for the log Pie term), )0;, Fi, OD ee Pe (--) = 


fiqa eas i Ban i 


417... +1 Mi+1 a +4 |l 
ee aay eo") Cor the log Pye term), and )>; Sie, a2, (3 e) lo 08 dic. i2,(ig) = 0 (for the log Pi - term), 


iy ‘sisal 


to yield 


L-1 
L(P,Q)- L(P*,Q*) = os > lg he + OR log Pic (A8) 
l1=0 i 


cluster c 1=0 iy41 cluster ¢c 


The first term may be simplified by interchanging the order of summation >7;, >). (---) = 20. Diy, (-++), and then 
marginalising the probabilities using that } ouster ¢ i, Pi, log Pr = Datuster ¢ ig Pic log Pre. The second term may 
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be simplified by interchanging the order of summation 75... )Jeuster ¢ ('') = Xatuster e224, (**), then marginal- 
ising the probabilities using that ) ouster ¢ Doin oe log Piet cine: ae Piet log Pa, and then using that 
component c in layer +1 is the parent of cluster c in layer I, to obtain 


L-1 L 
L(P,Q)-L(P*,Q*)=-S° So Pe SS log Pe +S°S > Pe SO log Ph (A9) 


1=0 if cluster c lH=1 if component c 


and using the definition of entropy (see equation 2.3) this may finally be written as 


L 
L(P,Q)-L(P*,Q”) = 3 ye. IP Oe ee) (A10) 


1=0 cluster c l=1 component c 


where H (P!) is the entropy of cluster c and H (P!) is the entropy of component c (both in layer 1). The mutual 


Cc 
information I (P!.) between the components c’ of cluster c is defined as 


(P)= S> H(PL)-HA(P)) (A11) 


component c’ 
in cluster c 


and using that De alustes c De uapeenk c (- a ) a SP esabanent c! this yields 


in cluster c 


SrP)y= S| #(P)- SS AP) (A12) 


cluster ¢c component c cluster c 


which allows L(P,Q)—L P*, Q’) to be simplified to 


L(P,Q)-—L(P*,Q*) = >> ne! +S) H(P)- SD H(P?) (A13) 


l=1 cluster c cluster c cluster c 


Appendix B: PMD 


In this appendix some of the more technical details relevant to section V are given. 


1. PMD Recognition Model 


If the type of posterior probability introduced in section V A, which is a product of n independent factors if there 
are n independent recognition models, is then inserted into equation 3.5 it yields a Dyg of the form 


M, M2 


Dvq=2 | dx BEC ss Pr (yi |x, 1) Pr (yo|x, 2) «+» Pr(ynlx,n) Ix—! (yi.y2,-° yndIIP (BY) 


yr=l1y2=1 Yn=1 


where Pr (y;,|x,&) denotes the posterior probability that (given input x) code index y; occurs in recognition model k. 
If x’ (yi, y2,°-* , Yn) is optimised (ie. takes the value that minimises Dyg) then it becomes 


Sian os = / dx Pr (ys |x) Pr (ya|x) «-- Pr (nlx) x (B2) 


The ||x — x! (yi, 42,-++ ,Ym)||” term may be expanded thus (by adding and subtracting 4 a1 Xe (Ye) 


2 


2 Ree ie 
IIx — x" (y1, ¥2,°°° YndIl = | (x — Xk in)) 7 (Ads (yi) — X" (Yt, Yas 2) 
k=1 k=1 
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Using these two results, together with Bayes’ theorem, allows an upper bound on Dyg to be derived as 


Mi, Moe 


Dye <2 f dx Pr( Pr (x pe So: Ss Pr (y1|x, 1) Pr (yo|x, 2) -- 


yi=l1y2=1 Yn=l 


n 


— =x, (ue) 


k=1 


(B4) 


Pr (Ym [X, 2) 


In this upper bound, the Pr (yz|x,&) are used to produce soft encodings in each of the recognition models (k = 
1,2,--+ ,n), then a sum + )77_, x}, (yx) of the vectors x}, (yz) is used as the reconstruction of the input x. In the 
special case where hard encodings are used, so that Pr (yx|x,k) = dy, y,(x), then the upper bound on Dyg reduces 


to Dyg < 2 f dx Pr (x) ||x-4+ yy 1Xh (yx ( x))||’. 


Note that the code vectors used for the encoding operation y; (x) 


are not necessarily - same as the x}, (yx), oe in the special case n = 1. 

Suppose that a single recognition model is independently used n times, rather than n independent recognition 
models each independently being used once. This corresponds to constraining the P'+! vectors and P'l'+! matrices 
to be the same for each of the n recognition models. The upper bound on Dyg can be manipulated into the form 


M 


2 2 ont) | 
Dyg <—- | dx P — x’ ——— | dxP 
va <= f dx Pr(x YDS Pr) Ibex’ (a + x Pr( 


where the k index is no longer needed. 


B. Full Bayesian Average Over Recognition Models 


One possible criticism of the recognition model given 
in equation 5.4 is that it is a mixture of K recogni- 
tion models, where each contributing model is assigned 
the same weight x: Normally, a posterior probability 
Pr (y|x) is decomposed as a sum over posterior probabil- 
ities Pr (y|x, k) derived from each contributing model, as 
follows 


K 
Pr (y|x) = S~ Pr (ylx, &) Pr (&]x) (2.6) 
k=1 


where each of the K recognition models is assigned a dif- 
ferent data-dependent weight Pr (k|x). The conditional 
probabilities Pr (k|x) and Pr (y|x) can be evaluated to 
yield 


3, Pr(xly, &) Pr (y|k) Pr (k) 


Pr (k|x) = 
oe Ep orm eae Pr (y’|k’) Pr (k’) 
Pr (x|y,k) Pr (y|k) Pr (k) 
Pr (y|x,k) = 22h 
oo Soyer Pr (xly’, &) Pr (y'|k) Pr (k) oe 
so that 


Pr (x|y, &) Pr (y|k) Pr (k) 


K 
Pr (y|x) = », saan yeh Pr (x|y’, k’) Pr (y’|k) Pr (k’) 
(2.8) 


If the replacements Pr (kK) > 1, Pr(y|k) > pes 
41 
Pr (x|y,k) > PUT and Pr (y|x) > pe 


Uy tl4.1? U415%1 


then Pr (y|x) reduces to 


~» Alri 


kytiga 
are made, 


K Il+1  yl+1 1+1 
1+1|l a1 ,t141 A; 41 Pe 
| ) 
141541 


41 pli gi 141 
Se ee a 


M41” Maa 


(2.9) 


2 


2 e (y|x) x’ (y) 


(B5) 


n 


which is not the same as the PMD recognition model in 
equation 5.4. The difference between equation 2.9 and 
equation 5.4 arises because the full Bayesian approach 
in equation 2.9 ensures that the model index k and the 
input x are mutually dependent (via the factor Pr (k|x)), 
whereas the PMD approach in equation 5.4 ignores such 
dependencies. 

In the full Bayesian approach (see equation 2.9) the 
normalisation term in the denominator has a double 


: K Mi41 Ul+1 4141 1+ 
summation )1y=1 izgi=1 Piiigr Abvings Piggy? Which in- 


volves all pairs of indices k and ij41 with A > 
0, which thus corresponds to long-range lateral inter- 
actions in layer 7+ 1. On the other hand, in the 
PMD approach (see equation 5.4) the normalisation 
term in the denominator has only a single summation 
M 
aes Poe Anis 
layer 1 + 1 are determined by the structure of the ma- 
trix ae which defines only short-range lateral con- 


nections (i.e. for a given recognition model k, only a 


limited number of index values i), satisfy Au > 0. 


Pee so the lateral interactions in 
141 


III. COMPARISON WITH THE HELMHOLTZ 
MACHINE 


In this appendix the relationship between two types of 
density model is discussed. The first type is a conven- 
tional density model that approximates the input proba- 
bility density (i.e. the objective function is L (P°, Q°)), 
and the second type is the one introduced here that 
approximates the joint probability density of a Markov 
source (i.e. the objective function is L(P,Q)). In or- 
der to relate L(P°,Q°) to L(P,Q) it is necessary to 
introduce additional layers (i.e. layers 1,2,--- ,£Z) into 
L (P°, Q°) in an appropriate fashion. 
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The Helmholtz machine (HM) [2] does this by replacing 
L (P°, Q°) by a different objective function (which has 
these additional layers present as hidden variables), and 
which is an upper bound on the original objective func- 
tion L(P°,Q°). It turns out that Helmholtz machine 
(HM) objective function Dy. and the Markov source ob- 
jective function L (P,Q) are closely related. The essen- 
tial difference between the two is that Dy jy does not in- 
clude the cost of specifying the state of layers 1,2,--- ,Z 
given that the state of layer 0 is known, which thus al- 
lows it to develop distributed codes (which are expensive 
to specify) more easily. 

In the conventional density modelling approach to neu- 
ral networks, there are two basic classes of model. In 
the case of both unsupervised and supervised neural net- 
works the source is P°, which is the network input (unsu- 
pervised case) or the network output (supervised case). 
Additionally, in the case of supervised neural networks 
P® is conditioned on an additional network input as 


15 


which is modelled by Q° (unsupervised case) or QO/mPtt 
(supervised case). Q° or Q°lP"t can be modelled in any 
way that is convenient. Frequently a multilayer genera- 
tive model of the form 


O _ aa O|1 
io wee 


41,72," 


Ul+1 L 
Qin ies io tL (3.1) 


is used, where the 7; (for 1 <1 < L) are hidden variables, 
which need to be summed over in order to calculate the 
required marginal probability OF, and the notation is de- 
liberately chosen to be the same as is used in the Markov 
chain model 
Qiostrente = Qin Uinta OE, 2) 
Helmholtz machines and Markov sources are related 
to each other. Thus the L (P°,Q°) that is minimised 
in conventional density modelling can be manipulated in 
order to derive Dyyy 


G(pit qi) 


M, 
1/0 1|0 1|0 
GQ: a) + 5 Pi °S- Pp 10 log Pp) io 


ig=1 i=1 


(3.3) 


Pimput Thus in both cases there is only an external 
source (i.e. source layers 1,2,---,L are not present), 
| 
L(P°,Q°) < L(P°,Q°) + >>) 
io= 1 
Mo 
1/0 
= — ‘ fan > ek io log (28,2 
io=l = 1 
Mo 
0 plo 0 cio 0 1/0 
= L((P PIP) .(Q ,Q?°)) — S> Po Hi, (P49) 
Mo 
= L(P,Q)- }> Pe Hi, (P") 
io=l 
= Dum 
The inequality L(P°,Q°) < Dzym follows from The 


Gi, (P'°, Q1I°) > 0 (ie. the model Q1!° is imperfect, so 
that QU? 4 P'!°). The inequality Dyjyy < L (P,Q) fol- 
lows from H;, (P1!°) > 0 (ie. the source P1!° is stochas- 
tic). If the model is perfect (Q!!° = P1!°) and the source 
is deterministic (P'!° is such that the state of layer 1 is 
known once the state of layer 0 is given), then these two 
inequalities reduce to L (P°,Q°) = L(P,Q). 

The properties of the optimal codes that are used by 
a Helmholtz machine when Dyjy is minimised may be 
investigated by writing the expression for Dy jy as a sum 
of two terms 


Da vee Kin (Pr, a) +9" Fe Cig (Pito*) 
(3.4) 
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Pe (P.O)... “parts sand: the 
DSi e on G (PH, Q') part compete with each other 
when Dyy is minimised. Assuming that Ee > 0, the 
Soe G(P1!°,Q!) part likes to make Q! approx- 
imate P!!°, which tends to make P!!° behave like a 
distributed pas the other hand, assuming that 
P} > 0, the 7°, P® K (P10, Ql") part likes to make 
Qo! approximate oe which tends to make P!!° behave 
like a sparse encoder. The tension between these two 
terms is optimally balanced when Dy yy is minimised. 


The properties of the optimal codes that are used in 
the Markov source approach when L (P, Q) (see equation 
2.13) is minimised are different. The 2-layer expression 


16 


for L (P,Q) is 


Mo 
L(P,Q) = }> P2 Ki, (PM, Q") +1 (P¥,Q*) (3.5) 
io=l 


which contains the same _ sparse encoder 


term 
See Ki, (P19, Qo?) as Diy. However, the 


distributed encoder term SS P? Gig (P19, Q') is 


missing, and is replaced by D (Pp! ‘ Q’) which does not 
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Self-Organised Factorial Encoding of a Toroidal Manifold * 


Stephen Luttrell 
Room EX21, QinetiQ, Malvern Technology Centre 


It is shown analytically how a neural network can be used optimally to encode input data that 
is derived from a toroidal manifold. The case of a 2-layer network is considered, where the output 
is assumed to be a set of discrete neural firing events. The network objective function measures 
the average Euclidean error that occurs when the network attempts to reconstruct its input from 
its output. This optimisation problem is solved analytically for a toroidal input manifold, and two 
types of solution are obtained: a joint encoder in which the network acts as a soft vector quantiser, 
and a factorial encoder in which the network acts as a pair of soft vector quantisers (one for each of 
the circular subspaces of the torus). The factorial encoder is favoured for small network sizes when 
the number of observed firing events is large. Such self-organised factorial encoding may be used to 
restrict the size of network that is required to perform a given encoding task, and will decompose 
an input manifold into its constituent submanifolds. 


I. INTRODUCTION 


The purpose of this paper is to show analytically how 
a neural network can be used to optimally encode input 
data that is derived from a toroidal manifold. For sim- 
plicity, only the case of a 2-layer network is considered, 
and an objective function is defined [12] that measures 
the average ability of the network to reconstruct the state 
of its input layer from the state of its output layer. The 
optimum network parameter values must then minimise 
this objective function. In this paper the output state is 
chosen to be the vector of locations of a finite number of 
the neural firing events that arise when an input vector 
is presented to the network, and, in the limit of a sin- 
gle firing event, this reduces to a winner-take-all encoder 
network. 


If the input vector is obtained from an arbitrary in- 
put probability density function (PDF), then the network 
would have to be optimised numerically, and a simple in- 
terpretation of its optimal parameters would not then 
be guaranteed. On the other hand, if the input PDF 
is constrained to have a simple enough form, then an 
analytic optimisation guarantees that the results can be 
interpreted. Because the purpose of this paper is mainly 
to interpret the nature of the optimal solution(s) that 
arise from the interplay between the input PDF and the 
network objective function, an analytic rather than a nu- 
merical approach will be used. 


The detailed form of the optimum network parameters 
depends on the chosen input PDF, and, for simplicity, 
the input PDF will be chosen to define a curved manifold 
which is uniformly populated by all of the allowed input 
vectors. The shape of this manifold then determines the 
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type of optimum solution that the network adopts. For 
instance, a 1-dimensional linear manifold with a uniform 
distribution of input vectors leads to an optimum solution 
in which each neuron fires only if the input lies within a 
small range of values, so the network behaves as a soft 
scalar quantiser. This result generalises to higher dimen- 
sional linear manifolds, where the network behaves as 
a soft vector quantiser. A more interesting type of opti- 
mum solution can occur when the manifold is curved. For 
instance, a circular manifold (which is a 1-dimensional 
manifold embedded in a 2-dimensional space) leads to 
an optimum solution that is analogous to the soft scalar 
quantiser obtained with a 1-dimensional linear manifold, 
but a toroidal manifold (which is a 2-dimensional mani- 
fold embedded in a 4-dimensional space) does not neces- 
sarily lead to an optimum solution that is analogous to 
the soft vector quantiser obtained with a 2-dimensional 
linear manifold. 

For a 2-dimensional toroidal manifold, it is possible for 
the optimum solution to be constructed out of a pair of 
soft scalar quantisers, each of which encodes only one of 
the two circular manifolds that form the toroidal mani- 
fold. This is called a factorial encoder (because it breaks 
the input into its constituent factors, which it then en- 
codes), as opposed to a joint encoder (which directly en- 
codes the input, without first breaking it into its con- 
stituent factors). Because a factorial encoder splits up 
the overall encoding problem into a number of smaller 
encoding problems, which it then tackles in parallel, it 
requires fewer neurons than a joint encoder would have 
needed for the same encoding problem. 

For the type of network objective function that is dis- 
cussed in this paper, factorial encoding does not occur 
with linear manifolds. This is because the random na- 
ture of the neural firing events does not guarantee that 
at least one such event occurs in each of the soft scalar 
quantisers in a factorial encoder, and, for a linear man- 
ifold, this leads to a much larger average reconstruction 
error if a factorial encoder is used than if a joint encoder 
is used. This effect is summarised in figure 1 for a linear 
manifold, and in figure 2 for a toroidal manifold. Hence- 
forth, only the toroidal case will be discussed, because 


it is a curved manifold which thus has interesting facto- 
rial encoding properties, whereas a linear manifold would 
not. 


Figure 1: Diagram (a) shows the encoding cells for joint en- 
coding of a 2-dimensional linear manifold; a typical encoding 
cell is shaded. Diagram (b) shows the corresponding encoding 
cells for a factorial encoder; typical encoding cells for each of 
the two factors and their intersection are shaded. The distor- 
tion that would result from only one of the two factors is large, 
because the encoding cell is a long thin rectangular region. 


Figure 2: Diagram (a) shows the encoding cells for joint en- 
coding of a 2-dimensional toroidal manifold; a typical encod- 
ing cell is shaded. Diagram (b) shows the corresponding en- 
coding cells for a factorial encoder; typical encoding cells for 
each of the two factors and their intersection are shaded. The 
distortion that would result from only one of the two factors 
is not as large as in the case of the corresponding linear man- 
ifold, because the long thin rectangular encoding cells are now 
wrapped round into loops, thus reducing the average separa- 
tion (in the Euclidean sense) of points within each encoding 
cell. 


In figure 2(a) the torus is overlaid with a 20 x 20 
toroidal lattice, and a typical joint encoding cell is high- 
lighted (this would use a total of 400 = 20 x 20 neurons). 
Figure 2(a) makes clear why such encoding is described 
as “joint”, because the response of each neuron depends 
on the values of both dimensions of the input. The neu- 
ral network implementation of this type of joint encoder 
would have connections from each output neuron to all 
of the input neurons. 

In figure 2(b) the torus is overlaid with a 20 x 20 
toroidal lattice, and a typical pair of intersecting facto- 
rial encoding cells is highlighted (this would use a total of 
40 = 20+ 20 neurons). Figure 2(b) makes clear why such 
encoding is described as “factorial”, because the response 
of each neuron depends on only one of the dimensions 
of the input, or, in other words, on only one factor that 
parameterises the input space. The neural network im- 
plementation of this type of factorial encoder would have 
connections from each output neuron to only half of the 
input neurons. In figure 2(b) an accurate encoding is ob- 
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tained by a process that is akin to triangulation, in which 
the intersection between the 2 orthogonal encoding cells 
defines a region of the 2-torus that is equivalent to the 
corresponding joint encoding cell in figure 2(a). 

For a toroidal input manifold it turns out that there 
is an upper limit to the number of neurons that can be 
used if a factorial encoder is to have a smaller average re- 
construction error than the corresponding joint encoder. 
This limit is smaller than the number of neurons that 
are used in figure 2(b), so that diagram should not be 
interpreted too literally. 


A. Vector Quantisers 


The existing literature on the simplest type of encoder 
(i.e. the vector quantiser (VQ)) includes the following 
examples: 


1. A standard VQ, in which the input space is parti- 
tioned into a number of non-overlapping encoding 
cells, which is also known as an LBG vector quan- 
tiser (after the initials of the authors of [7|). In 
operation, all of the input vectors that lie closest 
(in the Euclidean sense) to a given code vector are 
assigned the same code index (which thus defines an 
encoding cell), and the approximate reconstruction 
of these inputs is then the centroid of the encoding 
cell. This type of VQ can be viewed as a single- 
layer winner-take-all (WTA) neural network. 


2. A topographic VQ (TVQ), in which the code in- 
dices and encoding cells are arranged so that code 
indices that differ by a small amount are assigned 
to encoding cells that are close to each other (in the 
Euclidean sense). This topographic property auto- 
matically emerges if a VQ is optimised for encod- 
ing input vectors to be transmitted along a noisy 
communication channel [1, 3, 6, 8]. The Kohonen 
topographic mapping network [5] is an approxima- 
tion to this type of encoder, as was explained in 
[8]. The TVQ may be generalised to a soft TVQ 
(STVQ) in which each code index is chosen prob- 
abilistically in response to the corresponding input 
vector [4, 9]. 


3. Simultaneously use more than one standard VQ, 
with each VQ encoding only a subspace of the in- 
put (see for example [13]); in effect, more than one 
code index is used to encode the input vector. By 
this means, a high-dimensional space can be split 
up into a number of lower dimensional pieces. This 
type of VQ is equivalent to multiple single-layer 
WTA neural network modules, each of which oper- 
ates on a subspace of the input. This is an example 
of a factorial encoder, in which the input is split 
into a number of separate parts, or factors. 


4. The simultaneous use of multiple VQs can be ex- 
tended to a tree-like network of VQs [11]. This 
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type of VQ is equivalent to multiple single layer 
WTA neural network modules which are connected 
together in a tree-like network of modules. 


For simplicity, only the case of a 2-layer network (i.e. an 
input and an output layer) will be considered, but oth- 
erwise the network will be obliged to learn how to make 
use of all of its neurons. The simplest encoder which 
has all of the required behaviour, and which includes the 
above 2-layer examples as special cases, is one in which 
the neurons fire discretely in response to the input, and, 
after a finite number of firing events has occurred, the 
input is then reconstructed as accurately as possible (in 
the Euclidean sense). In the special case where only a 
single firing event is observed, this reduces to a standard 
LBG vector quantiser that was discussed in case 1 above. 
In the more general case, where a finite number of firing 
events is observed, this can lead to factorial encoder net- 
works of the type that was discussed in case 3 above. 


B. Curved Manifolds 


The purpose of this paper is to derive optimal ways 
of encoding data using neural networks in which multi- 
ple firing events are observed, and to show that factorial 
encoder networks can be optimal when the input data 
lies on a curved manifold. In order to get a feel for how 
curved manifolds arise in image data, consider the ex- 
amples shown in figure 3 and figure 4, which show the 
manifold generated by a single target (figure 3) and by 
a pair of targets (figure 4), when projected onto three 
neighbouring pixels (i.e. the locus of the 3-vector formed 
from these pixel values is plotted as the target(s) move 
around). 

Clearly, these image manifolds are curved, and the cur- 
vature gets greater the narrower the Gaussian profiles 
used to generate the target images become. 

It is not at all obvious how best to encode vectors that 
lie on such manifolds. For instance, one might try to 
tile the manifold with a large number of small encoding 
cells obtained from some variant of a VQ, or one might 
try to project the manifold onto a basis obtained from 
some variant of principal components analysis (PCA). In 
fact, these two examples are both special cases of the ap- 
proach that is advocated in this paper; a VQ corresponds 
to a single firing event, whereas PCA corresponds to an 
infinite number of firing events. 

The problem of optimally encoding data that is derived 
from a general curved manifold requires a numerical so- 
lution. However, in order to develop our understanding, 
it is best to start with an analytically tractable example 
based on a simple curved manifold, which is carefully se- 
lected to preserve the essential features of more general 
curved manifolds. With this in mind, the most important 
feature to preserve in the analytic example is curvature. 
A circle is the simplest 1-dimensional curved manifold, 
which may then be used to construct higher dimensional 
toroidal manifolds. For instance, a pair of circles may 


99 


1.2 


A 
os 


0.5 
A-1.0 


1.0 


Figure 3: Manifold formed when the 1-dimensional image of 
a target (a Gaussian profile with a half-width of one pizel) 
is moved around. Only the projection Aj,; onto the pixels at 
(7,7) = (1,0), (0,0) and (1,0) ts shown. 


be used to construct the 2-dimensional toroidal manifold 
shown in figure 2. It turns out that, if a toroidal mani- 
fold is used, then the network objective function can be 
analytically minimised to yield results that exhibit inter- 
esting joint encoder and factorial encoder properties. 


C. Structure of this Paper 


In section II the basic theoretical framework is intro- 
duced, from which some expressions are derived for opti- 
mising a network which is trained on data from a toroidal 
input manifold. In section III the detailed results for en- 
coding a circular input manifold are given (which are 
trivially related to the corresponding results for the case 
of joint encoding of a 2-torus), and in section IV these 
results are extended to the case of factorial encoding of 
a 2-torus. The results for joint encoding and factorial 
encoding are compared in section V. Some useful asymp- 
totic approximations are discussed in section VI, and a 
useful approximation to the optimal network is discussed 
in section VII. 


The main steps in the derivations are reported in the 
appendices to this paper, and in several cases there is a 
considerable amount of algebra involved, which was done 
using algebraic manipulator software [17]. 


Figure 4: Manifold formed when the 2-dimensional image of a 
target (a Gaussian profile with half-widths of one pixel in each 
direction) is moved around. Only the projection Aj,; onto the 
pixels at (4,7) = (—1,1), (0,0) and (1,1) is shown. 


II. BASIC THEORETICAL FRAMEWORK 


The encoder model that is assumed throughout this 
paper is a 2-layer network of neurons. The state of the 


= ~ | dx Pr) 7 Pr (ylx) log (xly) — SPr(y) los Q(y) 


Pr (x, y) is a joint probability that satisfies Pr (x, y) = 
Pr (y|x) Pr(x) = Pr(xly) Pr(y) (ie. Bayes’ theo- 
rem holds), Q (x,y) is an approximation to Pr (x,y) 
that satisfies the corresponding relationships Q (x,y) = 
Q(vlx) Q(x) = Q(xly) Qly), [dx Pr(x) (++) inte: 
grates over all the possible states of the input layer, 
d’y Pr (y|x) (--+) sums over all the possible states of the 
output layer given that the state of the input layer is 
known, and >°, Pr(y) (--+) sums over all the possible 
states of the output layer. 


The objective function Dp measures the average num- 
ber of bits required when the approximate joint proba- 
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input layer is denoted as an input vector x (which is as- 
sumed in this paper to be a continuous activity pattern), 
and the state of the output layer is denoted as the output 
vector y (which is assumed in this paper to be a discrete 
pattern of firing events). The information content of the 
output state y may be used to draw inferences about the 
input state x. This can be formalised by using Bayes’ 
theorem in the form 


Piya Pr (y|x) Pr (x) 


of dx! Pr(y|x’) Pr(x’) ey 


where the PDF Pr (x|y) of the input x given that the out- 
put y is known (i.e. the generative model) is completely 
determined by two quantities: the likelihood Pr (y|x) 
that output y occurs when input x is present (ie. the 
recognition model), and the prior PDF Pr (x) that in- 
put x could occur irrespective of whether y is being 
observed. However, for all but the most trivial situa- 
tions, if the functional form of Pr (y|x) is simple then the 
functional form of Pr (x|y) is complicated (or vice versa, 
with the roles of Pr (y|x) and Pr (x|y) interchanged). In 
other words, if the recognition and generative models are 
strictly related by Bayes’ theorem, then difficulties in- 
evitably arise in analytic and numerical calculations. 

A possible way around this problem is to use a network 
objective function Dp that has a simple functional form 
for the Pr (y|x), but has an approximation to the ideal 
Pr (x|y) implied by Bayes’ theorem (or vice versa). A 
convenient choice is 


= — [ dx Prey) log Q (x, y) 


bility Q (x,y) is used as a reference to encode each pair 
(x,y) drawn randomly from the true joint probability 
Pr (x, y) [15], so Do belongs to the class of minimum de- 
scription length (MDL) objective functions [14]. Strictly 
speaking, the number of bits depends on the accuracy 
with which the continuous-valued x is measured. How- 
ever, this refinement is omitted from equation 2.2 because 
it does not affect the results in this paper, provided that 
the size of the quantisation cells into which x is binned 
is much smaller than the scale on which Pr(x|y) and 
Q (x|y) fluctuate. 


The objective function Do can be simplified if Q (x, y) 
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is assumed to have the following properties 


Q(y) = 
Q(xly) = 


constant 


(2.3) 
IIx — x! va) 
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: exp ( 
(Vin c) dim x 


where the approximation Q (x|y) to the true generative 
model Pr (x|y) is a Gaussian PDF, and the prior proba- 
bilities Q (y) are constrained to all be equal. If the value 
of o is fixed, then Dp may be replaced by the simpler, 
but equivalent, vector quantiser objective function Dya, 
which is defined as 


Dvo = / dx Pr (x) 37 Pr (ybx) |e — (IP (2-4) 


where >’, Pr(y) = 1 has been used to eliminate the 
dy Pr (y) log Q (y) term. This measures the average Eu- 
clidean distortion that occurs when the input x is prob- 
abilistically encoded as y, and then subsequently recon- 
structed as x’ (y). This is a soft version of the LBG vec- 
tor quantiser objective function [7], in which y acts as a 
code index, Pr (y|x) acts a soft encoding prescription for 
probabilistically transforming x into y, and x’ (y) acts 
as the corresponding code vector. The optimal Pr (y|x) 
that minimises Dyg is deterministic (i.e. each x is trans- 
formed to one, and only one, y), so Dyg actually leads 
to an LBG vector quantiser itself, rather than merely a 
probabilistic version thereof [9]. 

Under the same assumptions (see equation 2.3) that 
yielded the expression for Dyg, the Helmholtz machine 
objective function [2] would reduce to 


Dum =Dvyet | dx Pr (x) > Pr (y|x) log Pr (y|x) 
y 
(2.5) 
where the extra term is the so-called “bits-back” term, 
which is (minus) the entropy of the output y given that 
the input x is known, then averaged over all inputs. Thus 
Dywm does not directly penalise Pr(y|x) that have a 
large entropy, or, in other words, it allows the recognition 
model Pr (y|x) to be such that many output states y are 
permitted once the input state x is known. This means 
that the recognition models produced by a Helmholtz 
machine tend to be more stochastic than they would 
have been had the “bits-back” term been omitted from 
Dym. Conversely, the objective function Dyg that is 
used in this paper directly penalises Pr (y|x) that have a 
large entropy, so the recognition models produced tend 
to be more deterministic than the stochastic ones that 
the Helmholtz machine would produce under equivalent 
circumstances. Thus using Dyg tends to lead to sparse 
codes in which few neurons can fire, whereas using Dy yy 
tends to lead to distributed codes in which many neurons 
can fire. 
The chosen objective function has both an information 
theoretic interpretation (given by Do in equation 2.2), in 
which it seeks to minimise the number of bits required to 
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encode Pr (x,y), and also an encoder/decoder interpre- 
tation (given by Dyg in equation 2.4), in which it seeks 
to minimise the Euclidean distortion that arises when x 
is encoded as y and then subsequently reconstructed as 
x’(y). Also, using Dyg as the network objective func- 
tion ensures backward compatibility with preexisting re- 
sults (e.g. [5, 7]). 

An upper bound on the network objective function is 
introduced in section II A, and the stationarity conditions 
which must be satisfied for an optimal network behaviour 
are derived in section IIB. Joint encoding on a 2-torus 
is discussed in section IIC, and factorial encoding on a 
2-torus is discussed in section ITD. 


A. Objective Function 


In order to make progress it is necessary to make some 
assumptions about the network output state y. Thus 
the output layer will be assumed to consist of M neu- 
rons that fire discretely in response to the input activity 
pattern x. Furthermore, y will be assumed to be an n- 
dimensional vector, that consists of the observations of 
the locations (yi, y2,°-: ,Yn) of the first n firing events 
that occur in response to input x (this is described in de- 
tail in [12]). Note that the individual y; are scalars, but 
the generalisation to vector-valued y; is straightforward. 

For compatibility with results published earlier (e.g. 
[9, 12]), the objective function that will be used here is 
D = 2Dyg, which has an upper bound D, + Dg given by 
(see appendix A for a detailed derivation and discussion) 


M 
: / dx Pr (x) 37> Pr (ylx) |x — x! (WI? (2.6) 


Do a=) [w Pr (x) 


n 


D, 


2 


M 
x — 7 Pr (ylx) x’ (y) 


where Pr (y|x) is the probability that neuron y fires first 
in response to input x, and x’ (y) is a reference vector 
that is used by neuron y in its attempt to approximately 
reconstruct the input. In the limit n = 1 only D, con- 
tributes, and a standard LBG vector quantiser emerges 
when D, is minimised. As n > co only D2 contributes, 
and a PCA encoder emerges when D2 is minimised. 

This upper bound D,+ Dz on the objective function D 
will be used to derive all of the results in this paper. Its 
functional form, in which Pr (y|x) appears only quadrat- 
ically (unlike in equation 2.5 for Daz), allows analytic 
results to be readily derived. 


B. Stationarity Conditions 


The upper bound D+ Dz (see equation 2.6) on the ob- 
jective function D = 2Dyq (see equation 2.4) needs to be 
minimised with respect to two types of parameter: pos- 
terior probabilities Pr (y|x) and reference vectors x’ (y). 


This could be done numerically for an arbitrary input 
PDF Pr (x) by using a gradient descent type of algorithm 
[12], but here D; + Dg will be analytically minimised for 
some carefully chosen special cases of Pr (x). 


n [ax Pr (x|y) x = x’ (y) + (n—- 


where Pr (y) > 0 has been assumed. The oe =0 
stationarity condition also has the solution Pr(y) = 0, 
but this solution may be discarded because Pr(y) > 0 
is always the case in practice. The right hand side 
of the stationarity condition in equation 2.7 has two 
contributions: a Dy,-like contribution which is a sin- 
gle reference vector x’(y), plus a Do-like contribu- 
tion which is n — 1 times a sum of reference vectors 


ee (f dx Pr (y'|x) Pr (x|y)) x’ (y’), where the coeffi- 


M 


M 
1) i dx Pr (x|y) S> Pr (y'|x) x! (y’) 
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The stationarity condition ll Citas 22) 
Ox'(y) 


pendix B 1) 


= 0 gives (see ap- 


(2.7) 


y=l 


cient { dx Pr(y’|x) Pr(x|y) accounts for the effect (at 
neuron y) of observing all pairs of firing events (y, y’) for 
y’ =1,2,---,M. The sum of these two terms is n times 
the total reference vector that is effectively associated 
with neuron y, which is n times { dx Pr (x|y) x as given 
on the left hand side of equation 2.7. 


6(Di+Dz2) 


The stationarity condition Roe Pra. = 0 gives (see 


appendix B 2) 


(Pr (y'|x) — dy,y) x (y') - 


5x (y') 


where the constraint Set Pr (y’|x) = 1 has been im- 
posed, and Pr(x) > 0 and Pr(y|x) > 0 have been as- 
sumed. The aoe: = 0 stationarity condition also 
has two other solutions: either Pr (x) = 0, or Pr(x) > 0 
and Pr(y|x) = 0. Using the normalisation constraint 
et Pr (y|x) = 1, the last of these solutions ensures 
that Pr (y’|x) < 1 for y’ 4 y, and when all values of y are 
considered the net effect is to constrain Pr (y|x) to the 
interval 0 < Pr (y|x) < 1, as expected. 

The solutions of the stationarity condition for Pr (y|x) 
in equation 2.8 are piecewise linear functions of x. This 
piecewise linear property of Pr (y|x) (as discussed in ap- 
pendix B2) is an enormous simplification, because it 
means that rather than searching the infinite dimensional 
space of functions Pr (y|x) for the optimal ones that min- 
imise D, + Do, one needs only search a finite dimensional 
space of piecewise linear functions Pr (y|x) (subject to the 


constraints 0 < Pr (y|x) < 1 and Dayan Pr (y|x) = 1). 


C. Joint Encoding 


Joint encoding, as shown in figure 2(a), is characterised 
by a Pr (y|x) in which the neurons labelled by y form a 


(2.8) 


discretised version of the manifold that x lives on. For 
instance, when x lives on a 2-torus, so that x = (x1, x2) 
where x; = (cos6,, sin@,;) and x2 = (cos 6s, sin 62), 
where 0 < 6, < 27 and 0 < @ < 2z, the Pr(y|x) typ- 
ically behave as shown in figure 2(a), where the 2-torus 
is tiled with encoding cells. When n > 1 neighbouring 
encoding cells overlap, so figure 2(a) does not then give 
an accurate representation of the encoding cells. 


For joint encoding of a 2-torus, y must be replaced 
by the pair (y1,y2), where the y; index labels one direc- 
tion around the toroidal lattice, and y2 labels the other 
direction (this notation must not be confused with the 
(Y1,Y2,°** + Yn) notation that was used in section II A). 
Thus Pr (y|x) > Pr (y1, y2|x1,X2) with 1 < y < JM 
and 1 < y2 < VM. For simplicity, assume Pr (x), x2) = 
Pr (x1) Pr (x2), where Pr (x,) and Pr (x2) each define a 
uniform PDF on the input manifold. The following re- 
sults for D; and Dz may then be derived (see appendix 
C1) 
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ie 
Dy = 


= [ ax: Pr (x1 


y=1 


These results for D; and Dz show that, under the simpli- 
fying assumptions made above, the problem of optimis- 
ing a joint encoder is equivalent to the problem of op- 
timising an encoder for x; alone (with the replacement 
M — VM), and then multiplying the value of D, + D2 
by a factor 2 to account for x2 as well. This illustration 
of the behaviour of joint encoder posterior probabilities 
in the case of Pr (y1, y2|x1, X2) may readily be generalised 
to higher dimensions. 


D. Factorial Encoding 


Factorial encoding, as shown in figure 2(b), is charac- 
terised by a Pr (y|x) in which the neurons labelled by y 
are partitioned into a number of subsets, each of which 


ls 
>> Pr (y1|X1) ||x1 — x, (w)II° 
ia 2 
= S- Pr (yi|x1) X41 (y1) (2.9) 


forms a discretised version of a subspace of the manifold 
that x lives on. For instance, when x lives on a 2-torus, 
and the neurons are partitioned into two equal-sized sub- 
sets, the Pr (y|x) typically behave as shown in figure 2(b), 
where each of the two circular subspaces within the 2- 
torus is tiled with encoding cells, which overlap when 
n> 1. 


For factorial encoding of a 2-torus Pr(y|x) = 
Pr (y[x1, 2) a 4 Pr (y|x1) + 3 Pr (y|x2), where 


a 1 Pr(y|x1) = 1, ie M4, Pr (y|xe) = 1, Pr(y|x1) = 
0 for +1<y<M, and Pr(y|xo) = 0 forl<y< 
4. For simplicity, assume Pr (x,x2) = Pr (x1) Pr (x2), 
where Pr (x) and Pr (x2) each define a uniform PDF on 
the input manifold. The following results for D; and D2 


may then be derived (see appendix C 2) 


M 
2 
Dy = =| [dx Prix) 1D Peal) bbe — x1 IP + be Pr G2) bel? 
iva 2 
4(n—1) u 
Dy, = ——— |] dx; Pr(x1) ||xi — pF (y|x1) x4 (y) (2.10) 
I 
These results for D; and Dz show that, under the simpli- angular variable 0, thus 
fying assumptions made above, the problem of optimising é 
a factorial encoder is closely related to the problem of op- x = (cos 6, sin 4) 
timising two 1-dimensional encoders. This illustration of 1 s?* 
the behaviour of factorial encoder posterior probabilities je Pr(x) (--) = = dO (---) (3.1) 


in the case of Pr (y|x,, x2) may readily be generalised to 
higher dimensions. 


III. CIRCULAR MANIFOLD 


The analysis of how to encode data that lives on a 
curved manifold begins with the case of data that lives 
on a circle. In particular, assume that the input vector 
x is uniformly distributed on the unit circle centred on 
the origin, so that x can be parameterised by a single 
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The posterior probability Pr (y|x) may thus be replaced 
by Pr(y|@), and for purely conventional reasons, the 
range of y is now chosen to be y = 0,1,--- , M—1 rather 
than y = 1,2,--- ,M. The set of M posterior probabili- 
ties for y = 0,1,--- , M —1 can be parameterised as 


Pr (yl) =» (0-24) 


where p(@) is the 6-dependence of the posterior prob- 
ability associated with the y = 0 neuron. The 6- 
dependence of p(@) must be piecewise sinusoidal (i.e. 


(3.2) 


made out of pieces that each have the functional form 
a +6cos@ +c sin@) in order to ensure that Pr (y|x) is 
piecewise linear, as is required of solutions to equation 
B4. Similarly, the M corresponding reference vectors can 
be parameterised as 


=r (ow (2) .sn(2)) 9 


which all have length r, and thus form a regular M-sided 
polygon. 

It turns out that, for input vectors that live on a cir- 
cular manifold, optimal joint encoding never causes more 
than 3 different neurons to fire in response to a given in- 
put (ie. no more than 3 posterior probabilities overlap 
in input space). This severely limits the number of dif- 
ferent piecewise functions that have to be manipulated 
when solving the D; + Dz minimisation problem for in- 
put vectors that live on a circle. An analogous simpli- 
fication also holds for joint and factorial encoding of a 
2-torus. The case of 2 overlapping posterior probabili- 
ties can be optimised without too much difficulty, but 
the case of 3 overlapping posterior probabilities involves 
a prohibitively large amount of algebra, for which it is 
convenient to use an algebraic manipulator [17]. The 
calculations turn out to be highly structured, so the use 
of an algebraic manipulator could in principle be used to 
solve even more complicated analytic problems. 

All of the results for encoding input data that lives on a 
circular manifold may be derived from the expression for 
D, + Dg in equation 2.6 (and the corresponding station- 
arity conditions), with the replacement given in equation 
3.1 to ensure that the input manifold corresponds to a 
uniform distribution of data around a unit circle, and the 
functional forms given in equation 3.2 and equation 3.3. 

The corresponding results for joint encoding of data 
that lives on a 2-torus can be obtained directly from 
these results (see section IIC). The expression for the 
minimum value of D, + D2 for joint encoding a 2-torus 
using VM x VM neurons is obtained by making the re- 
placement M — VM in the expression for the minimum 
value of D; + D2 for encoding a circle using M neurons, 
and then multiplying this result by 2 in order to account 
for both the circles that form the 2-torus (see equation 
2.9). 


A. Two Overlapping Posterior Probabilities 


A detailed derivation of the results reported in this sec- 
tion is given in appendix D1. Because the neurons have 
an angular separation of = (see the form of the posterior 
probability given in equation 3.2), the functional form of 


p(@) may be defined as 


1 Ob Ss 
p(0)=* f (0) poss GaSe (3.4) 
0 |e| =a +s 


Self-Organised Factorial Encoding of a Toroidal Manifold 


where the s parameter is half the angular width of the 
overlap between the posterior probabilities of adjacent 
neurons on the unit circle, in which case 0 < s < 
ensures that no more than two neurons can respond to a 
given input. Anticipating the optimum solution, a typical 
example of this type of posterior probability is shown in 


figure 5. 


In order to guarantee that Pr(y|x) has a piecewise 
linear dependence on x, as is required of solutions of 
equation 2.8, f (9) must have the sinusoidal dependence 
f (@) =a+6cosé+c sin|6|, where the use of |6| arises 
because p (0) = p(—6). Note that the Pr(x) = 0 solu- 
tion to the stationarity condition on Pr (y|x) (see equa- 
tion 2.8) implies that Pr (y|x) is undefined for any x that 
does not lie on the unit circle. However, for those x that 
do lie on the unit circle, the a, b and c parameters can 
be determined by demanding continuity of p(0) at the 
ends of its piecewise intervals (i.e. at 0 = 75 — s and 
6 = 4 +s), and by demanding that the total probability 
of any neuron firing first is unity (i.e. the total posterior 


probability is normalised such that f (0)+f (3% —@) =1 
in the interval 5 —s <@< 75 +58), to obtain 
1 1sin(#— 8) 
0) = M 3.5 
F() 2 = 2 sin s 53) 
This corresponds to a _ piecewise linear contribu- 
tion to Pr(y|x) whose gradient points in the 


(- sin (4) , COS (4)) direction. A typical example 
of this type of posterior probability is shown in figure 5. 


p(8) 


Figure 5: Plot of the optimal neural posterior probability p (0) 
for M = 8 and n = 2. The neighbouring posterior proba- 
bilities p(a + =) are also plotted. The optimal value of s 
is s = 0.4955. The departure of p(@) from linearity in the 


interval 77 — 8 <0< 4748 is too small to be easily seen. 


Without loss of generality (because the solution is sym- 


metric under rotations of 0 which are multiples of ) 
set y = 0 in equation 2.8, to obtain in the interval 
was s9<q7t+s 
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0=r cse?s sin (*) sin (= -9) (sins — sin (= -9)) (n sins — (n — 1) rsin (= )) 


which may be solved for the optimum length r of the 
reference vectors, to yield 


n sin s 
r=—_ — (3.7) 
n—1 sin (37) 
Set y = 0 in equation 2.7 to obtain a transcendental 
equation that must be satisfied by the optimum s 


n-1M ., T : _ 
= sin (=) (coss+s sins) =0 (3.8) 


sin s 
sin (4) n 


The symmetry of the solution may be used to make 
the replacement + i dO (---) + M f* dO (---) in the 
expressions for D; and D2, which may then be evaluated 
and simplified to yield the minimum D, + D2 as 


M 
Dyas Dy = = (93 sin s)) 


n—1 27 29) 


The value of s which should be used in this expression 
for D, + Dg is the solution of equation 3.8 for the chosen 
values of M and n. 

Note that the expression for r in equation 3.7 and the 
expression for D; + Dg in equation 3.9 both have a finite 
limits as n — 1, because the limiting behaviour of the 
solution s of equation 3.8 is s > (n— 1) sin? (4) (see 
the asymptotic results in section VI), which contains a 
factor n—1 to cancel the aaa factor that appears in both 
equation 3.7 and equation 3.9. 


B. Three Overlapping Posterior Probabilities 


A detailed derivation of the results reported in this 
section is given in appendix D2. Because the neurons 
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have an angular separation of 47 


p(@) may be defined as 


, the functional form of 


fi (9) OS OS i 8 
_ J fe(9) a es |0| Sa ae 
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where the s parameter is half the angular width of the 
overlap between the posterior probabilities of adjacent 
neurons on the unit circle, in which case a Lisi = 
ensures that no more than 3 neurons can respond to a 
given input. Anticipating the optimum solution, a typical 
example of this type of posterior probability is shown in 


figure 6. 


In order to guarantee that Pr (y|x) has a piecewise lin- 
ear dependence on x, the f; (9) must have the sinusoidal 
dependence f; (0) = a; +b; cos6+c; sin |6| for i = 1,2,3. 
For those x that lie on the unit circle, the a;, b; and 
c; parameters can be determined by imposing continuity 
of p(@) at@ = -F+s8,0= $7 —s and 6 = a +8, 
and normalisation of the total posterior probability such 
that fi (0) + fs (4¢ + 0) + fs (4¢ — 0) = 1 in the interval 
0<0<— +s, and fz (0)+ fe (45 _ ) = 1 in the inter- 
val-y7ts<O0< 3a —s. Also, to satisfy the stationarity 
conditions, set y = 0 in equation 2.7, and also set y = 0 
in equation 2.8 in each of the intervals 0 <@< —77 +8, 
—-f+s<0< sand 4-s<d0< Fs. These 
conditions are sufficient to solve for the optimum The 
fi (0) for i = 1,2,3, the optimum r, and the optimum s. 


The optimum f; (@) are 


(37 -s) + 08s 2005( 7) cost) esc? (7) see ($7 —s) 
(cot (7) sec (5-5) sin ( 
0) sec (37 -s) -1) 


7-9) +1) 


(3.11) 


which correspond to different piecewise linear contributions to Pr (y|x). The f; (@) piece has a gradient that points in 
the (1,0) direction, the f2 (9) piece has a gradient that points in the (—sin (7) , cos (#;)) direction, and the fs (4) 


3m 


piece has a gradient that points in the (- sin (35) , COS (3)) direction. The optimum r is 


nr 
r= 


cos (44 — s) 


n—-1 
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and the transcendental equation that must be satisfied by the optimum s (for WM = 4 this reduces to equation 3.8) is 


1 cos (3 —s) 


nt M cos (7) (5 ( 
in 
n cos (>) n ow M 


and the minimum D, + Dz may be obtained as 


Dy 4 Dy a MAD 2852 ~ Ms) ~ see? (H)) 


(3.13) 


5-)-G)=@-9)- 


M 


Dra tye 


As in section IIIA, the limit n — 1 is well behaved be- 
cause the limiting behaviour of the solution s of equation 
3.13 contains a factor n — 1 (see the asymptotic results 
in section VI) to cancel the ay factor that appears in 
both equation 3.12 and equation 3.14. 
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Figure 6: Plot of the optimal neural posterior probability p (@) 
for M =8 and n= 100. The neighbouring posterior proba- 
bilities p (0 4 on) are also plotted. The optimal value of s is 


“MM 
8% 1.3955. 


The results for the optimum value of s (i.e. equation 
3.8 and equation 3.13) may be combined to yield the 
results shown in figure 7. 

Asymptotically, as M — co and n > ov, the contour 
s = 37 (the dashed line in figure 7), which is the bound- 
ary between the regions where 2 and 3 posterior proba- 
bilities overlap, is given by n % 3 ue (see the asymptotic 
results in section VI). 

The corresponding results for joint encoding of input 
vectors that live on a 2-torus are shown in figure 8. 


IV. TOROIDAL MANIFOLD: FACTORIAL 
ENCODING 


All of the results for factorial encoding of input data 
that lives on a toroidal manifold may be derived from the 
expression for D, + D2 in equation 2.10 (and the corre- 
sponding stationarity conditions), with the appropriate 
replacements for equations 3.1, 3.2 and 3.3. 


n ((n—1) (2— “s) + sec? (4)) Bea (= 


ai 1) 


Figure 7: Contour plot of the optimum value of sversus 
(n,M) for encoding of a circular manifold. The solid con- 
tours are for the interval 0 < s < 47, the dotted contours are 
for & <s < 3, and the dashed contour is for s = & (this 
behaves asymptotically as n & 3 ue The contours are all 
separated by intervals of z5x7- 


The posterior probability p (@) then has the same func- 
tional form as for a circular manifold, except that M is 
replaced by u because each of the two dimensions uses 
exactly half of the total of M neurons, so these results 
are not quoted explicitly here. The steps in the deriva- 
tion of the optimum values of r and s and the minimum 
value of D; + D2 are analogous to the steps that appear 
in the derivation for a circular input manifold, and the 
results are sufficiently different from the ones that were 
obtained from a circular manifold that they are quoted 
explicitly here. 


106 


Self-Organised Factorial Encoding of a Toroidal Manifold 


18 
16 | 


14 | 


| 


\ 


10 | 


| 


\ 


| 


\\ 


+l 


| 
il 
\ 


40 


Figure 8: Contour plot of the optimum value of s versus 
(n, M) for joint encoding of a toroidal manifold. The solid 
7, the dotted con- 


contours are for the interval 0 < s < 
and the dashed contour is for 


F—- <s< 


tours are for ae 


tours are all separated by intervals of 
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8s= TR (this behaves asymptotically as n & 3 *). The con- 


n 


ae 
\ he Sate 
10 20 


VM’ 


T 
10VM~ 


A. Two Overlapping Posterior Probabilities 


A detailed derivation of the results reported in this sec- 
tion is given in appendix D3. The stationarity conditions 


11 


yield the optimum r as 


sins (4.1) 


__2n__ sins | 
~ n—1 sin (37) 


The transcendental equation that must be satisfied by 
the optimum s is 


n—-1M 
n+1 27 


sin s 


2 
sin (3) (coss + s sins) =0 
(4.2) 


sin (45) 


The expression for the minimum D, + D2 is 


n 


(4.3) 


M 
= (2s + sin (2s)) 
n—1 20 


D,+ Dz =4- 


B. Three Overlapping Posterior Probabilities 


A detailed derivation of the results reported in this sec- 
tion is given in appendix D4. The stationarity conditions 


yield the optimum r as 


4 

2n cos (4 — s) 
n—1 cos (2% 

The transcendental equation that must be satisfied by 


(4.4) 


—— 


the optimum s is 


1 cos(47—s) n—-1M Qn An 4 An 
i —— = 4, 
eos (2) In on 8 \ sin | 77 — 8 wo) lar 78 0 (4.5) 
The expression for the minimum Dj, + Dp is 
fet nO vee) a Gabe a) ar) (| ui 
(n — 1) (n —1) M 


Firstly, equation 3.8 (with the replacement M > VM, 
) may be used to deduce the region 


ae 


and setting s = at 


The results for the optimum value of s (i.e. equation 4.2 


and equation 4.5) may be combined to yield the results 


shown in figure 9. 


Vv. 


The results in section III and section IV may be used to 


JOINT VERSUS FACTORIAL ENCODING 


of the (n,M) plane where joint encoding of a 2-torus 


involves no more that 2 overlapping posterior probabil- 


27) may be used to 


ities, and equation 4.2 (with s T 
deduce the corresponding result for factorial encoding of 


a 2-torus. Once these regions have been established, it is 
then possible to decide which of equation 3.9 or equation 


3.14 (with M — VM and then multiplied overall by 2) 
to use to calculate D; + D2 in the case of joint encoding 


deduce when a factorial encoder is favoured with respect 
to a joint encoder (for input data that lives on a 2-torus). 
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Figure 9: Contour plot of the optimum value of s versus 
(n, M) for factorial encoding of a toroidal manifold. The solid 


contours are for the intervalO<s < on the dotted contours 


are for a <s< =, and the dashed Eaton is for s = = 
(this behaves asymptotically as n © 3 uy). The contours are 


all separated by intervals of <77- 


a 2-torus, and which of equation 4.3 or equation 4.6 to 
use to calculate D, + D2 in the case of factorial encoding 
a 2-torus. These results are gathered together in figure 
10. 

The need to derive results where up to 3 posterior prob- 
abilities overlap (which involves a large amount of alge- 
bra) is clear from the results shown in figure 10, where it 
may be seen that most of the region where the factorial 
encoder is favoured with respect to the joint encoder has 
up to 3 overlapping posterior probabilities. The degree 
to which a factorial encoder is favoured with respect to 
a joint encoder may be seen in figure 11. 

If the number of neurons M is restricted (i.e. M < 12), 
then the joint encoding scheme in which the 2-torus is en- 
coded using small encoding cells as shown in figure 2(a), 
is usually not as good as the factorial encoding scheme 
in which the 2-torus is encoded using the intersection of 
pairs of elongated encoding cells as shown in figure 2(b). 
This does require that the number of firing events n is 
sufficiently large that both subsets of u neurons in the 
factorial encoder are virtually guaranteed to each receive 
at least 1 firing event, so that they can indeed approx- 
imate the input vector by the intersection of a pair of 
response regions. 

If the number of neurons M is too large (i.e. M > 12), 
then the joint encoding scheme is always favoured with 
respect to the factorial encoding scheme, because there 
are sufficient neurons to encode the 2-torus well using 
small response regions, as shown in figure 2(a). This 
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Figure 10: The diagram shows various results pertaining to 
joint and factorial encoding of a 2-torus. The solid line is 
the boundary between the regions of the (n,M) plane where 
joint or factorial encoding are favoured, and the horizontal 
dashed line is the asymptotic limit M ® 12 of this boundary 
as n— oo. The left hand dashed line is the boundary between 
the regions where 2 or 3 overlapping posterior probabilites oc- 
cur in joint encoding, and the right hand dashed line is the 
corresponding boundary for factorial encoding. 
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Figure 11: Plots for M = 6,7,8,9,10,11 of 
(Di + D2) pactoriaa — (D1 +D2)joine im units in which 


(Di + D2) pactorial = 1. This makes it clear that the degree to 
which a factorial encoder is favoured with respect to a joint 
encoder is quite significant for large n. 


includes the limiting case I — oo, where the curva- 
ture of the input manifold is not visible to each neuron 
separately, because each neuron then responds to an in- 
finitesimally small angular interval of the input manifold. 
This result implies that joint encoding is always favoured 
when the input manifold is planar, as was discussed in 
figure 1 and figure 2. 
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Although not presented here, these results generalise 
readily to higher dimensional toruses, where factorial en- 
coding is even more favoured, because (roughly speak- 
ing) the number of neurons required to do joint encoding 
with a given resolution increases exponentially with the 
dimensionality of the input, whereas the number of neu- 
rons required to do factorial encoding with a given res- 
olution increases linearly with the dimensionality of the 
input (provided that enough firing events are observed). 


VI. ASYMPTOTIC RESULTS 


Referring to figure 10, the asymptotic behaviour as 
M -— oo lies in the region where two posterior proba- 
bilities overlap, and the asymptotic behaviour as n — oo 
lies in the region where three posterior probabilities over- 


13 


lap, so care must be taken to use the appropriate results 
when deriving the various asymptotic approximations be- 
low. The boundary between the regions where two or 
three posterior probabilities overlap can be obtained for 
a circular input manifold by putting s = 77 in equation 
3.8 (or s = 7 in equation 4.2 in the case of a toroidal 
input manifold), and as M —> oo this is given by 


circular manifold 
toroidal manifold (factorial encoding) 
(6.1) 

As M + co the asymptotic behaviour of D, + D2 for 

a circular input manifold may be obtained by asymptot- 

ically expanding the s dependence of equation 3.8 (or 

equation 4.2 in the case of a toroidal input manifold) in 

inverse powers of M, to yield 


circular manifold 


n—-l aw (n—1) (n?—4n+2) 1 
ee ae ve 6.2 
no 7 + =) a apo” (22) toroidal manifold (factorial encoding) oe) 
and substituting this solution into the appropriate expression for r to obtain 
(2n?—6n+3) 7? : : 
A; 1+ — a OT circular manifold 
a Qn | 8n (n?—4n+1) 12 t ay ‘Pld (actonel aj (6.3) 
atl + Stl? M2 oroidal manifold (factorial encoding) 
and substituting this solution into the appropriate expression for D; + D2 to obtain 
: Gney) a circular manifold 
D,+ D2 4 64m? j ; . : (6.4) 
nti + 3(n41)> WP toroidal manifold (factorial encoding) 


The asymptotic result for a circular manifold may be used 
to determine the corresponding result for a linear man- 
ifold. Thus, if lengths are scaled so that the separation 
of the neurons (as measured around the circular mani- 
fold) becomes unity, which requires that all lengths are 
divided by 24, then asymptotically as MZ —> oo the cir- 
cular manifold solution becomes identical to the solution 
for a linear manifold with neurons separated by unit dis- 
tance. Thus the optimum solution for a linear inantold 
n 


with neurons separated by unit distance is s = “> and 


Di + Do = 2n—1 


6n?2 


of (length)”). 


(note that D; + D2 has the dimensions 


As n — 1 (ie. the LBG vector quantiser limit) the 
asymptotic behaviour of D; + D2 for a circular input 
manifold may be obtained by expanding the s depen- 
dence of equation 3.8 about the point s = 0 (or equation 
4.2 about the point s = 0 for a toroidal input manifold), 
to yield 


circular manifold 
toroidal manifold (factorial encoding) 


(6.5) 


which gives s = 0 when n = 1, so there is no overlap between the posterior probabilities for different neurons, as 
would be expected in a vector quantiser where only one neuron is allowed to fire. Substitute this solution into the 


appropriate expression for r to obtain at n = 1 


ey 7, (a) 


5, sin (47) 
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circular manifold 
toroidal manifold (factorial encoding) 


(6.6) 
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which is the distance of the centroid of an arc of the unit circle (with angular length aaa for a circular manifold, or 
angular length =“ for a toroidal manifold) from the origin, as expected for a network in which only one neuron can 
fire. So the best reconstruction is the centroid of the inputs that could have caused the single firing event. These 


results may be substituted into the appropriate expression for D; + D2 to obtain at n = 1 


9 (MY)? 2 ( 
pp 2 2 (a) sin’ ir) 
4—2 (3,) sin’ Gr) 


These results for D, + D2 have a simple geometrical in- 
terpretation. For a circular manifold D, + Dz is (twice) 
the average squared distance from an arc with angular 
length 7 to its associated reference vector, which is ex- 
actly what would be expected. For a toroidal manifold 
D, + De is the same result with M > yu plus an ex- 
tra contribution of 2, because a factorial encoder with 
only 1 firing event acts as a conventional encoder using 
u neurons for the circular dimension that is fortunate 


enough to be associated with the firing event (hence the 


i 
3 
20 3m 
M Mn cos?( 7) 
SR 1 
3 
An 127 
M Mn cos? ( 37 ) 
where the limiting values of s as n —> 00 (ie. s > 3 


for a circular manifold, and s > = for a toroidal man- 


ifold) stops just short of allowing four or more posterior 
probabilities to overlap. In this limit D; = 0, so for a cir- 
cular manifold the network acts as a PCA encoder (see 
the discussion after equation 2.6) whose expansion coef- 
ficients sum to unity. In order to encode vectors on a 
unit circle without error three basis vectors are required; 
the expansion coefficients are probabilities which must 


ap (2 Gatien) | 


Thus as n —> oo it is possible to derive a value of M 


3 sec( Fr) (: mam) | 


circular manifold (6.7) 
toroidal manifold (factorial encoding) , 


first contribution to D, + Dz), and acts as no encoder at 
all for the other circular dimension which is associated 
with no firing events (hence the extra contribution of 2 
to Dy, + D2). 


As n + co the asymptotic behaviour of D, + D2 for 


a circular input manifold may be obtained by expanding 
TT 


: : ~ 19 
the s dependence of equation 3.13 about the point s = 77 


(or equation 4.5 about the point s = a for a toroidal 
input manifold), to yield 


circular manifold 


toroidal manifold (factorial encoding) 


sum to unity, so three basis vectors are required in or- 
der that there are two independent expansion coefficients. 
This is the reason why it is sufficient to consider no more 
than three overlapping posterior probabilities for encod- 
ing data that lives in a 2-dimensional manifold (this argu- 
ment generalises straightforwardly to higher dimensions). 
The same argument applies to the case of factorial encod- 
ing of a toroidal manifold. Substitute this solution into 
the appropriate expression for r to obtain 


circular manifold 


toroidal manifold (factorial encoding) 


circular manifold 


toroidal manifold (factorial encoding) (6.10) 


for which the asymptotic D, + Dz is the same for joint 
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and factorial encoding of a toroidal manifold. This value 
_ : 4 T _ 4 20 

of M must satisfy * tan? (Fy) = 4 (2sec? (47) — 1), 

which yields M = 11.74. 


VII. APPROXIMATE THE POSTERIOR 


PROBABILITY 


A posterior probability may always be written in the 
form 


Q (x\y) 
M— 
57) Q (x|y’) 
where Q(x|y) > 0 (with Q(xly) > 0 for at least one 
value of y for each x). If the neurons behaved in such 


a way that they produced independent Poissonian firing 
events in response to a given input, then Q (x|y) would 


Pr (ylx) = (7.1) 


15 


be the firing rate (or activation function) of neuron y in 
response to input x. 

The optimum solution p(@) (as given in equation 3.4 
and equation 3.5) may be approximated on the unit circle 
(i.e. x = (cos6,sin@)) by defining Q (x|y) as 


Ql) = fy *~ 8 


(a7) -* (az) 8 
cos M sin M sins 


where a is a threshold parameter, and w is a unit weight 
vector. This is the form of the neural activation function 
that is used in [16]. This leads to a good approximation 
to the optimum solution p (0) because 


(7.2) 


0 0<-7-8 
Q(xly=0 Hl L_ 7 3 wT wT 
PSEC ea ree 0 (0 i) ) = 8S 0s —g ts 
p(d)=21 ea ee ee (7.3) 
Q(x|y=0) 3 T T 
Oy=0)+OCW=D +0((0- m) ) 7 eae de 
0 06> +8 


This approximation works well because curved input manifolds can be optimally encoded by using appropriate hy- 
perplanes (as defined in equation 7.2) to slice off pieces of the manifold. 
This approximation breaks down as M —+ oo, as can be seen by inspecting the series expansion of p(0) near 


g= =. 


3 
oe) 2 ne (0 me) 


Nie 
Nir 
g. 

=] 

® 
— 


NIFH 
NIK 
a 

z|a 


p(@) = 
a (9 ) 3 (st tan( 1 


which differ in the O((@— %)") term. In the limit 


M —> o the half-width parameter s behaves like M~', 


so the O( (9-4 3\ term behaves like M(@— 2)° in 
M M 


3: ; 
the exact case, and M3 (0 _ a) in the approximate case 


because of the contribution from the ; term. 

tan( 7) sin? s 
As M —>+ o each neuron responds to a progressively 
smaller angular range of inputs on the unit circle, so from 
the point of view of each neuron the curvature of the in- 
put manifold becomes negligible (i.e. the input manifold 
appears to more and more closely approximate a straight 
line), which ultimately makes it impossible to use hyper- 
planes to slice off pieces of the manifold. In the M — co 
limit, a better approximation to the posterior probabil- 
ity would be to use ball-shaped regions (e.g. a radial ba- 
sis function network) to cut up the input manifold into 
pieces. 
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0 ((@- #)') exact 
=) (0-@)°+0((0-H)*) — opproximate 
I 
VIII. CONCLUSIONS 


The results in this paper demonstrate that, for input 
data that lies on a curved manifold (specifically, a 2- 
torus), and for an objective function that measures the 
average reconstruction error (in the Euclidean sense) of 
a 2-layer neural network encoder, the type of encoder 
that is optimal depends on the total number of neurons 
and on the total number of observed firing events in the 
network output layer. There are two basic types of en- 
coder: a joint encoder in which the network acts as a 
vector quantiser for the whole input space, and a facto- 
rial encoder in which the network breaks into a number 
of subnetworks, each of which acts as a vector quantiser 
for a subspace of the input space. 


The particular conditions under which factorial encod- 
ing is favoured with respect to joint encoding arise when 
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the input data is derived from a curved input manifold, 
provided that the number of neurons is not too large, and 
provided that the number of observed neural firing events 
is large enough. Factorial encoding does not emerge when 
the input manifold is insufficiently curved, or equivalently 
when there are too many neurons, because then each neu- 
ron does not have a sufficiently large encoding cell to be 
aware of the manifold’s curvature. 

Factorial encoding allows the input data to be encoded 
using a much smaller number of neurons than would be 
the case if joint encoding were used. Because only a small 
number of neurons is used, a factorial encoding scheme 
must be succinct, so it has to abstract the underlying 
degrees of freedom in the input manifold; this is a very 
useful side-effect of factorial encoding. This effect be- 
comes stronger as the dimensionality of the curved input 
manifold is increased. 

The main simplification that makes these calculations 
possible is that, in an optimal neural network, the form 
for the posterior probability is a piecewise linear function 
of the input vector. This leads to an enormous simplifica- 
tion in the mathematics, because only the space of piece- 
wise linear functions needs to be searched for the optimal 
solution, rather than the whole space of functions (sub- 
ject to normalisation and non-negativity constraints). 

A convenient approximation to this type of factorial 


M M M 
D=2 f dx Pros) > S- baht ee Pr (yi, yo,°°* 


yr=l yo=1 Yn=1 


where Pr(y|x) has now been replaced by the 
more explicit notation Pr(y1,y2,--- ,Yn|x), and 
x’ (yi, Y2,°°* 5 Yn) is a vector given by 


x! (Yi, Yo,0°° stm) = f a Pr (x|¥1,Y2,°°* On) (A3) 


where Pr (x|y1,y2,°-* ,Yn) may be expressed in terms 
of Pr (x) and Pr (y1, y2,-*: ;Yn|x) by using Bayes’ the- 
orem in equation 2.1. The goal now is to minimise 
the expression for D in equation A2 with respect to 
the function Pr (y1, y2,-++ ,Yn|x). The correct value for 
x’ (Yi, Y2,°** 5 Yn) may be determined by treating it as an 
unknown parameter that has to be adjusted to minimise 
D. 

Pr (y1, y2,°** ; Yn|X) may be interpreted as a recogni- 
tion model which transforms the state of the input layer 
into (a probabilistic description of) the state of the out- 
put layer, and x’ (y1, y2,--: , Yn) may be regarded as the 
corresponding generative model that transforms the state 
of the output layer into (an approximate reconstruction 
of) the state of the input layer. 

There is so much flexibility in the choice 
of = Pr(yi,y2,°-:;Yn|x) (and the corresponding 
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encoder is the partitioned mixture distribution (PMD) 
network [10], in which the individual subnetworks in the 
factorial encoder network are constrained to share pa- 
rameters, which thus leads to an upper bound on the 
minimum value of the objective function that would have 
ideally been obtained with the unconstrained factorial 
encoder network. 
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Appendix A: Objective Function 


The objective function D = 2Dyq is given by 
2 
De 2 | ax Pr (x) S$” Pr(y|x) |k—x’(y)|> (AL) 
y 


If the observed state of the output layer is the locations 
of n firing events on M neurons, then this expression for 
D can be manipulated into the following form [12] 


2 
»Yn|X) I|x =a (Y1,Y2; mais ,Yn)| 


x’ (y1,Y2,°°*;Yn)) that even if D is minimised, it 
does not necessarily yield an encoded version of the 
input that is easily interpretable. One way in which a 
code can be encouraged to have a simple interpretation 
is to force x’ (y1, Y2,°** ;Yn) (ie. the generative model) 
to be parameterised thus [12] 

x! (yi, ¥2°** tn) =x" (yi) +2! (yo) +2+ +2” (dn) (AY) 
which is a (symmetric) superposition of reference vectors 
x’ (y) from each neuron y that has been observed to 
fire. In this case each neuron has a clearly identifi- 
able contribution to the reconstruction of the input, 
which makes it much easier to interpret what each 
neuron is doing. In this case the ||--- ||? term in D 
is symmetric under interchange of the (yi, y2,--- ,Yn); 
so only the symmetric part S [Pr (y1, y2,--: ,Yn|X)] 
of Pr(yi,y2,°°:,Yn|xX) under interchange of the 
(y1, Y2,°** ; Yn) contributes to D, because the symmetric 
summation < ee . es (+: 
all non-symmetric contributions. 

Define the marginal probabilities 
and = Pr(y1,y2|x) of the 


) then removes 


Pr (yi|x) 
symmetric part 
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S [Pr (Yi, Ye, aie ,Yn|X)] obo Er (Yi, Ye. ae »Yn|X) under 
interchange of the (yi, y2,--- , Yn) as 
M 
Pr(yi[x) = S> S [Pr (1, ya,-++ Yn|x)] 
Y25Y3Y4s* Yn=1 
M 
Pr (yi, y2|x) = S> S[Pr(y1,y2.°++ .Ymlx)1A5) 


Y3,Y4ae Yn=l 


These marginal probabilities are for the case where n 
firing events have potentially been observed, but only the 
locations of 1 (or 2) firing event(s) chosen randomly from 
the total number n have actually been observed, with the 
locations of the other n—1 (or n—2) firing events having 
been averaged over. 
If it is assumed that Pr (yi|x) and Pr (y1, y2|x) are re- 
lated by 
Pr(sn.yolx) =Pr(yilx) Pr(yolx) (6) 
then the objective function D has an upper bound D; + 
Dz given by [12] 


Dyes 


M 
= f dx Pr(x) > Pr (ula) |x — x’ WIP 


n 


dD, 


D2 = 20a) fix Pr (x) 


M 
x — 3 Pr (yl) x (y) 
- (A7) 


Each of the two marginal probabilities in equation A5 
contributes to a different term in D; + D2; Pr (yi|x) con- 
tributes to D1, whereas Pr (y1, y2|x) contributes to Do. 
Informally speaking, D; measures the information that 
a single firing event (out of n such events) contributes 
to the reconstruction of the input, whereas D2 measures 
the information that pairs of firing events (out of n such 
events) contribute to the reconstruction of the input. D, 
is weighted by a factor L which suppresses the single fir- 
ing event contribution as n + oo, whereas D2 is weighted 
by a factor nat which suppresses the double firing event 
contribution as n > 1, as expected. If only the D, part 
of the objective function is used (ic. nm = 1), then a 
standard LBG vector quantiser [7] emerges which ap- 
proximates the input by a single reference vector x’ (y), 


0 (D, + D2) 
Ox’ (y) 
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--* / dx Pr (x) Pr(ylx) { x—x’ (y) +(n—1) $> Pr(y![x) (x—x’(y)) 
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whereas if only the D2 part of the objective function is 
used (i.e. m — oo), then the network behaves essen- 
tially as a principal component analyser (PCA) which 
approximates the input by a sum of reference vectors 
oo Pr (y|x) x’ (y), where the Pr(y|x) are expansion 
coefficients which sum to unity, and the x’ (y) are basis 
vectors. 

The upper bound D, + Dz on D contains LBG en- 
coding and PCA encoding as two limiting cases, and 
gives a principled way of interpolating between these ex- 
tremes. This useful property has been bought at the 
cost of replacing D by an upper bound bound D, + Dz, 
which will yield only a suboptimal (from the point of 
view of D) encoder. However, this upper bound can be 
expected to be tight in cases where the input manifold 
can be modelled accurately using the parameteric form 
x’ (yi) +x’ (y2) +--- +x’ (yn). These conditions are well 
approximated in images which consist of a discrete num- 
ber of constituents, each of which may be represented by 
an x’ (y) for some choice of y. This model fails in situa- 
tions where two or more constituents are placed so that 
they overlap, in which case the image will typically con- 
tain occluded objects, whereas the model assumes that 
the objects linearly superpose. Occlusion is not an easy 
situation to model, so it will be assumed that the im- 
age constituents are sufficiently sparse that they rarely 
occude each other. 


Appendix B: Stationarity Conditions 


The expression for D,; + D2 (see equation 2.6) has 
two types of parameters that need to be optimised: the 
reference vectors x’(y) and the posterior probabilities 
Pr (y|x). In appendix B1 the stationarity condition for 
x’ (y) is derived, and in appendix B2 the stationarity 
condition for Pr (y|x) is derived, taking into account the 
constraints 0 < Pr(y|x) < 1 and pan Pr(y|x) = 1 
which must be satisfied by probabilities. 


1. Stationary x’ (y) 


The stationarity condition Se = 0 for x’ (y) was 
derived in [13]. Thus a 1+P2) can be written as 


Ox! (y) 


M 
(B1) 


y'=1 
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and, using Bayes’ theorem in the form Pr (x|y) Pr (y) = Pr (y|x) Pr (x), this yields a matrix equation for the x’ (y) 


0=Pr(y) {nf dx Pr(aly) x(n) » a [ox Preaty) Priv‘) XW) -x'W) (B2) 


There are two classes of solution to this stationarity condition, corresponding to one (or more) of the two factors in 
equation B2 being zero. 


1. Pr(y) = 0 (the first factor is zero). If the probability that neuron y fires is zero, then nothing can be deduced 
about x’ (y), because there is no training data to explore this neuron’s behaviour. 


2. n f dx Pr(x|y) x =(n—1) Vyrey (ae Pry (xly ) Pr (y'|x)) x’ (y’) +x’ (y) (the second factor is zero). The solu- 
tion to this matrix equation is ale required x’ (y). 


2. Stationary Pr (y|x) 


The stationarity condition ots: 


be derived. Thus functionally differentiate D; + D2 with respect to log Pr (y|x), where logarithmic differentation 


(with the normalisation constraint ya Pr (y’|x) = 1) for Pr (y|x) will now 


implicitly imposes the constraint Pr (y|x) > 0, and use a Lagrange multiplier term L = [ dx’ d(x’) aime Pr (y’|x’) 


to impose the normalisation constraint ee Pr (y|x) = 1 for each x, to obtain 


6(D,j+D2,—L) _ 2 ; : 
sePrG). pe eel 


AOD prix ) Pr (y|x) x’ ( w(x dor (y|x) x’ ( w) 


—X (x) Pr (y|x) (B3) 


The stationarity condition implies that para Pr (y|x) See = 0, which may be used to determine the Lagrange 


multiplier function (x). When (x) is substituted back into the stationarity condition itself, it yields 


M M 
1 
0 = Pr (x) Pr(ylx) D> (Pr (y/o) — by) x W)- (5x!) — nxt (n= 1) DO Pry") x (y")} (BA) 
y’=1 yt=1 
There are several classes of solution to this stationarity itively reasonable because D, + De is of the form 
condition, corresponding to one (or more) of the three J dx Pr(x) f (x), where f (x) is a linear combina- 
factors in equation B4 being zero. tion of terms of the form x’ Pr (y|x)’ (for 7 = 0, 1,2 


and j = 0,1,2), which is a quadratic form in x (ig- 
noring the x-dependence of Pr (y|x)). However, the 
terms that appear in this linear combination are 
such that a Pr (y|x) that is a piecewise linear func- 
tion of x guarantees that f (x) is a piecewise linear 


1. Pr(x) = 0 (the first factor is zero). If the input 
PDF is zero at x, then nothing can be deduced 
about Pr (y|x), because there is no training data to 
explore the network’s behaviour at this point. 


2. Pr(y|x) = 0 (the second factor is zero). This fac- combination of terms of the form x" (for i = 0,1, 2), 
tor arises from the differentiation with respect to which is a quadratic form in x (the normalisation 
log Pr (y|x), and it ensures that Pr (y|x) < 0 cannot constraint wei Pr (y|x) = 1 is used to remove a 
be attained. The singularity in log Pr (y|x) when contribution to that is potentially quartic in x). 
Pr (y|x) = 0 is what causes this solution to emerge. Thus a piecewise linear dependence of Pr (y|x) on 

i x does not lead to any dependencies on x that are 

3. ae (Pr (y'|x) — dyy) x (y') + (+) = 0 (the not already explicitly present in D, + D2. The sta- 
third factor is zero). The solution to this equa- tionarity condition on Pr (y|x) (see equation B4) 
tion is a Pr (y|x) that has a piecewise linear depen- then imposes conditions on the allowed piecewise 


dence on x. This result can be seen to be intu- 
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linearities that Pr (y|x) can have. 


Appendix C: Simplified Expressions for D; + D2 


The expressions for D; and D2 (see equation 2.6) may be simplified in the case of joint encoding and factorial 
encoding. The case of joint encoding is derived in appendix C1, and the case of factorial encoding is derived in 
appendix C2. In both cases it is assumed that x = (x,xX2) and Pr(xi,x2) = Pr(x1) Pr(x2) where Pr(x,) and 
Pr (x2) each define a uniform PDF on the input manifold. 


1. Joint Encoding 


The expressions for D; and D2 may be simplified in the case of joint encoding, where x = (x1, x2), y = (41, y2) for 
l<y< VM and1< y2 < VM. In the following two derivations of the expressions for D, and D2 the steps in the 
derivation use exactly the same sequence of manipulations. 

The expression for D, is 


== [ ax: ae Pr (x1, X2) 3 Pr (yi, yo|X1, X2) (Cz ) = & rae )) (C1) 


yi=ly2=l 
The assumed properties of Pr (x1, x2) imply that x4 (yi, y2) = x4 (y1) and x4 (y1, y2) = x4 (y2), which gives 


VM VM 
Dy == f cy dy Pr (2.32) YY Pr (yrs, xa) (= 361 (Wa)? + [psa = (v0)1?) (C2) 


yi=ly2=1 


Marginalise Pr (yi, y2|x1,xX2) where possible, using that yas Pr (y1, y2|X1,X2) = Pr (ya|x1,x2) = Pr(ye|xe2) and 


yi=l 
pwn Pr (y1, yo|X1,X2) = Pr (yi|x1, 2) = Pr (yi|x1), to obtain 


2 VM VM 
2 2 
D,= = [ ax: dx2 Pr (x1,%2) | S> Pr (yifxz) llr — x4 (yi) + 5 Pr (yolx2) |Ix2 — x4 (y2)I| (C3) 
i yi=l yo=l 


Marginalise Pr (x,,x2) where possible, using that { dx; Pr(xi,x2) = Pr(x2) and f dx2 Pr(x1,x2) = Pr(x1), to 
obtain 


9 VM VM 
Dy == f dey Pr(oa) )) Pr (use) [pu — x; (on)? + = fda Pr (xa) 9° Pr (vols) [lee — x5 (02) I? (C4) 
n 
yal yo=l 
Because of the assumed symmetry of the solution, these two terms are the same, which gives 
ri VM 
D, =— | dx, P P oe : C5 
1 - | X1 ra) Dy r (y1[X1) ||x1 — x} (y1)]| (C5) 
= 


The expression for D2 is 
2 


Dz = AOD fate dx2 Pr (x1, x2) a ) = 2 Pr (y1, Ya|X1, X2) ( ) (C6) 


n XQ Xo (V1, Y2 
yi=1y2=1 


Use that x1 (y1, y2) = x4 (y1) and x4 (yi, y2) = X% (ye). 


2(n—1) 


Dz = = 5 fos dx» Pr (X1, X2) 
JM VM : JM JVM : 
x | [x1 — Se S- Pr (yi, Y2|X1,X2) X41 (y1)|f + |[X2 — S- S- Pr (yi, yolX1,X2) Xo (y2) (C7) 
yr=ly2=1 yi=1 yo=1 
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Use that se , Pr (yi, y2|X1,X2) = Pr (yo|x2) and ee 1 Pr (yi, ye|x1,X2) = Pr (yi|x1). 


2 2 
2 (n _ 1) = ! ae / 
Dg = ———— _ a 
p= SE fda dxy Pr x1, 22) { for — YO Pe abs) 21 n)f + [pr — D0 Pr uabee) x4 (u2)]) | (C8) 
yi=1 yo=1 
Marginalise Pr (x1, x2). 
2 2 
VM VM 
2(n—1 2(n—1 
Dz = 2m) [es Pr (x1) |fxi — a Pr (yi|x1) x4 (y1) 2a fix Pr (x2) ||x2 — a, Pr (yo|x2) X> (ye) 
y= y2= 
(C9) 
Use symmetry 
4(n—1) ule ; 
Do = a [es Pr (x1) xX, — SS Pr (y1|X1) x} (y1) (C10) 
yal 
These results may be combined to yield finally 
4 uM. 5 kaa) vie ; 
D, + D z= = | ax Pr (x1 >> Pr (yi|x1) ||x1 — x4 (yi) ||" + ———— [es Pr (x1) |]x1 — S- Pr (yi|x1) x4 (y1) 
n 
yal yi=l 
(C11) 


which has the same form as D; + D2 would have had for x;-space alone, with the replacement M —> ye followed by 
multiplication by a factor 2 overall. This implies that the problem of optimising a joint encoder is trivially related to 
the problem of optimising an encoder in the x;-space alone. 


2. Factorial Encoding 


The expressions for D; and D2 may be simplified in the case of factorial encoding. In the following two derivations 
of the expressions for D; and D2, the steps in the derivation use exactly the same sequence of manipulations, except 
that D2 has one additional step which separates the contributions inside ||-- - ||’. 

The expression for D, is 


2 - X1 x, (y) \||° 
D, = = [ ax: dxX2 Pr (x1, Xe poe (y|X1, X2) | ( ei ) _ i (y) ) (C12) 
Split up Pr (y|x1, x2), using that Pr (y|x1,x2) = 4 Pr (y|x1) + § Pr (y|x2), which gives 
1 . x1 \_ (x4 (y) )|)° 
D, = x ~ | ax: dxX2 Pr (x1, X2 2 (Pr (y|x1) + Pr (y|x2)) | | % ) - @ (y) ) (C13) 


Assume that the input manifold is such that x) (y) = 0 for 44 +1<y< M, and x} (y) =0 forl<y< ¥. Also use 
that Pr (y|x1) = 0 for 44 +1 < y < M, and Pr(y|x2) =0 jek 1<y< ¥%, to obtain 


m= footers nazerat |(2)-(%8) 


2 


y=l1 
1 M m= 0 2 
1 
+7 fax dx» Pr (x1, X2) 0B Pr (y|x2) (= ) = & ‘) } (C14) 
y= a t1 
Because of the assumed symmetry of the solution, these two terms are the same, which gives 
¥ 
2 
Dy == f cty dy Pr (21,262) > Pr (ybes) (xa — 4 ()[? + [e2l?) (C15) 
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Marginalise Pr (x,,x2) where possible, using that { dx, Pr(x,,x2) = Pr(x2) and f dx2 Pr(x,x2) = Pr(xz;), to 
obtain 


M 
2 2 

ie esas pox Pr (x1) 5> Pr (y|x1) [xa — x4 (y VIP + f dx Pr (x2) ||x2||? (C16) 
y=1 


The expression for D2 is 


D2 = cae [es dx2 Pr (x1, X2) | ( ee ) -yr (y|X1, X2) & : ) (C17) 
Use that Pr (y|x1,x2) = 5 Pr (y|x1) + 3 Pr (ylx2). 
Dy = 2D) fate dey Pr x17) I(z :) 2 Pr (yaa) +Pr(vbe)) (2 (2) )  @s 


Separate the contributions from the upper and lower components inside |]-- - II”, to obtain 


Dy = 2 OD) f sty dey Pr (1 28) (=) -L yp wba) (*) 2° Pree) (4499) C19 


y=1 y= 


Use that x) (y) = 0 for +1 < y < M, and x} (y) =0 forl <y < #. Also use that Pr (y|x1) = 0 for 4+1<y< M, 
and Pr (y|x2) = 0. 


M 2 
2(n—-1 1S 
Dog J 20a) fax: dx9 Pr (X1, X2) xX, — 3 2 (y|x1) pt (y) 
y= 
2 
er M 
AAD fax, dxg Pr (x1, X2) |/ko — = ae Pr (y|x2) x (y) (C20) 
2 
Use symmetry. 
u 2 
4(n-1 1 
Dz = Aa fax dxg Pr (x1, Xe) |/x1 — 5 Pt (ylx1) x4 (y) (C21) 
y= 
Marginalise Pr (x1, x2). 
‘Me 2 
4(n—1) 1< 
Dz = aD fat Pr (x1) |x1 — po r(y|xi) x4 (y) (C22) 
These results may be combined to yield finally 
= M ‘ 
2 og ASD 1 
Dit Da = 5 f dx Pro) Do Pr(vis) boss —xi ODI + fd, Pr) far 9 9 Pr Coe) xh 
y= y= 
2 2 
+2 | dx Pr (x2) ||xal| (C23) 


The stationarity conditions may be derived from this expression for the factorial encoding version of D; + Dz. The 
stationarity condition w.r.t. Pr (y|x,) is 


M 
= 1 n—-1 
mh Pr (y'[pe1) ~ yy’) 24 (/) > | 54 (Y') — 3a +S DF Pry’ xs) xi (y") | = 0 (C24) 
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and the stationarity condition w.r.t. x (y) is 


nf dx: Pr Gly) 2 =x} (uy) + "= fdas Proily) Yo Pr o/b) x1 (W) (C25) 


Both of these stationarity conditions can be obtained from the standard ones by making the replacements 
M S M 
(m2 = 1) Soyer Pr (y/bx1) x4 (y!) @ BZ* Dyed Pr (y/[x1) x1 (y’) and M > ¥. 


Appendix D: Minimise D; + D2 


The expression for D; + Dz needs to be minimised with respect to the reference vectors x’ (y) and the posterior 
probabilities Pr (y|x). There are four cases to consider, which are various combinations of circular/toroidal input 
manifold (appendices D 1 and D 2/appendices D3 and D4) and two/three overlapping posterior probabilities (appen- 
dices D1 and D3/appendices D2 and D4). For a toroidal manifold it is not necessary to consider the case of joint 
encoding, because it is directly related to encoding a circular manifold, which is dealt with in appendices D1 and D2. 


1. Circular Manifold: 2 Overlapping Posterior Probabilities 


For 0 < s < {fF the functional form of p(@) that ensures a piecewise linear Pr (y|x) is 


1 0< |< Z-s 
p()=4 f() wo Ss lels ats (D1) 
0 || > +s 


where f (9) = a+bcos@+csin |6|. Continuity of p(@) gives f (4 — s) =1 and f (4 +s) =0. Normalisation of p (4) 
in the interval 7 —s <0 < 4 +s requires that f (0) + f (4 — 0) =1. These yield f (@) in the form 
1 lsin(-é 
joa 8 (D2) 


sin s 


D, + Dz must be stationary w.r.t. variation of p(@) in the interval 77 — s < @ < 77 +, which yields the condition 


0 =rcsc? s sin (=) sin e — 6) (sins — sin (= — 6) 


x (nsins - (n —1) rsin (=) (D3) 


=r en (an (8) (ans—an (8) (nano 107s) 
0 rese?s sin (5 sin | 77 6) (sins — sin iu 0 n sins —(n—1)rsin u (D4) 
which gives the optimum solution for r as 

n sin s 


nT sin) 


(D5) 


r= 


D, + D2 must be stationary w.r.t. variation of r. This yields a transcendental equation that must be satisfied by the 
optimum solution for s as 


i —-1M 
= = sin ( E ) (coss +s sins) =0 (D6) 
sin (4) n 7 M 


D, and D2 may be written out in full as (using n (0) = (cos 0, sin @)) 


i (a n@—rai+ f a6 F (6) In@—rm (les [aos (Fr -°) 


NT T_<¢ sm 
M M 


118 


Self-Organised Factorial Encoding of a Toroidal Manifold 23 


y= DM (an inca) raion? + [2 a [nor 40m o-rs(7-0) o(F)/ | 


nt 
M 
(D8) 
The optimum f (#) and r may be substituted into D,; + Do, the integrations evaluated, and then the condition that 
the optimum s must satisfy may be used to simplify the result, to yield the minimum D, + D2 as 


HE ate On) (D9) 


Hype os 
1+ 02 —127 


2. Circular Manifold: 3 Overlapping Posterior Probabilities 


For 4 < s < 3% the functional form of p(@) that ensures a piecewise linear Pr (y|x) is 


fi (9) 0<||<-Ft+s 
fo () -Hts<ll< 4-s 


PO) (0) B—s< ile Hts a 
|a|> +s 


where f;(@) = a; + b; cosé + cj sin|6| for i = 1,2,3. Continuity of p(@) gives fi(-—3 +s) = x (-Z +s), 
fo (3% —s) = fs (3% —s) and f3 (4% +s) = 0. Normalisation of p (0) in the interval _) <60< —3% +8 requires 
that fi (0) + fs (ar +0) + fz (4% — 4) =1, and normalisation of p(@) in the interval - +s < 0 < a7 ~ & requires 
that f2 (0) + fo (47 — 4) = 1. These conditions may be used to eliminate all but a pair at parameters in the f; (9), 
which may thus = written in the form 


fi (0) = ; cos (0) sec (= - s) +a, (1 — cos (6) sec (= - s)) + bz cos (6) csc (=) sin (3 - :) sec (= - s) 


f2(9) = ; + be (cos (0) — cot (=) sin (0) 


f3(0) = 5 (1 cso (77 25) sin (7 3 6)) ! m1 (cos ( 35-8) sec (77-8) -1) 


+b, ese (>) ae ere ae in(— +5-6) (D1) 

2 esc | 77) esc | 77 — 28) sin| 77 —s]} sin( 47 +5 — 
Dy, + Dz must Pe stationary w.r.t. veravin of p(@) in each of the 3 intervals 0 < 6 < —75 +s (interval 1), 
—i+s<060< *4~s (interval 2), and 37 — 5s <6 < 4 +s (interval 3). The Fourier transform w.r.t. 0 of each of 


these 3 stationaniiy conditions has 5 fering with basis functions (1, cos 6, sin 0, cos 20, sin 20), and each of the total of 
15 Fourier coefficients must be zero. There are only 3 free parameters a), b2 and r, so only 3 of the 15 are actually 
independent; the particular 3 that are used are selected on the basis of ease of solution for the free parameters a1, bg 
and r. The coefficient of the cos 26 term in interval 2 yields 


2 
bor (n+ 2bor — 2boern) cos ($7) £20 (D12) 


which has the solution 


maT (D13) 


which may be substituted back into the coefficient of the cos @ term in interval 1 to yield 


0 = rsec (= — s) sin (=) 


x (( 1) 6a 2) rsin () +(n—1) (2a? —3a; +1) rsin (FF) 


ini aa) sean ea) bis 
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and also substituted back into the coefficient of the sin term in interval 3 to yield 


0 = rcos (=) csc (= — s) sec (= - s) sin? (=) 


—2a1 (3a, — 2) cos (35 = s) 
—2 (a, —1) a, cos (3 + s) 
—(n-l)r + (1 — 2a, +243) cos (57 — 8) 
: ae (1 = Aa, + ai) cos (s) (D15) 


sn (148) 0m i) = (r= Deo) 


—2a, sin (4) sin (44 _ 


These two conditions may be solved for a; and r to yield 


27 
COS | => 
a, = 2a Gn) (D16) 
cos (34 —1 
and 
ay SE all 3eC ( us ) (D17) 
oS pene WE ae 


The solutions for a1 and bp may be substituted back into the expressions for the f; (@) to reduce them to the form 


fi(@) = -5 (cos (3 :) + coss — 2 cos () cos6) esc? (=) sec (= -s) 


f2(9) = ; (cot (=) sec (3 - :) sin (= - 0) ~ 1) 
fa (0) = —3 ese? (=) (cos (3 a a) sec (3 s) = 1) (D18) 


D, + D2 must be stationary w.r.t. variation of r. This yields a transcendental equation that must be satisfied by the 
optimum solution for s as 


pa eon Ge) (Gros) -Gr-s)*(G-))=2 om 


M 


Dy, and Dz may be written out in full as 


fi (8) (In (@) — rn (0)|I° 


fo *** a9 +s (3 — 9) [In @) — rm (3A) If 
+fs (22 +8) |fn(@) —rn (—25) | 
_ Mt an _s fo (0) ||n (0) — rn (0)||? 
OF na ea + fa (35 ~8) |[n(6) — rn (35)! ae) 


ie fs (0) |n (0) — rn (0)||° : 
+ fe ao +f 3F ~8) |In(@) ~ rn (35)| 
+fs (4 —8) \hn(6) — rn (42)| 


pote ap n (9) — fi (9) i (2 Nees 
MM coo Fes a rn pus 
cals "fF +8) rm (35) 
at %-s yg || 2) — fe @) rn(0) 
Dog nT + fe dg a (22 - ) rn (27) | (D21) 
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25 

The optimum f; (9) and r may be substituted into D, + Dz, the integrations evaluated, and then the condition that 
the optimum s must satisfy may be used to simplify the result, to yield the minimum D, + D2 as 
nC Ty (222 Ms 
bya Dy a R= DAM? Bs) 


7 


Dinky 


sec? (*)) n((n—1) (2 2) sec” (4)) oe (4 


7 25) (D22) 


3. Toroidal Manifold: 2 Overlapping Posterior Probabilities 


For 0 < s < 3" the functional form of p(@) may be obtained directly from the circular case with the replacement 
M- yu. so that 


1 0<|a|\< G-s 
p()=4 (0) Fs <6 < 24s (D23) 
0 ja] > +5 
1 1 sin (44 a ) 
0)==4 i D24 
F() 2 2 sin s ( ) 
D, + D2 must be stationary w.r.t. variation of p (0) in the interval “7 —s <0 < 4 


a7 + 8, which yields the condition 


20 27 27 20 
_ 2 . ee . SE . tat Bas . = a . aa 
0=r csc s sin ($7) sin (F 0) (sins sin (3 0) ) a sins — (n 1) rsin ($7) ) (D25) 


which has the same form as the circular case with the replacements MZ u and n> 2 
solution for r as 


> PES which gives the optimum 


_ 2n sin s 
“nt Sin) 


(D26) 
D, + D2 must be stationary w.r.t. variation of r. This yields a transcendental equation that must be satisfied by the 
optimum solution for s as 


sin s n-1M , T ( ee ee 
sin coss+s sins) = 
sin (44) n+1 27 M 


(D27) 
which has the same form as the circular case with the replacements M — 
written out in full as 


M and n + 8. Dj; and Do may be 
Jot ° 46 (1+ [jm (6) -rn(0)|?) 
Dy =—— + fet _ dof (0 ) (1+ [In(6) — rn (0) |) (D28) 
+ Jf, 0f (4% — 8) (1+ [In (0) — rn (39)|7) 


2 


n (6) — 5rn(0) 


2m 
M 
+ | dé 
20 


a8 


y= RDM ( fa n(0)~ 5rF (0) m(0)~ rf ( 57-8) n( 57] 


(D29) 
The optimum f (#) and r may be substituted into D,; + Do, the integrations evaluated, and then the condition that 
the optimum s must satisfy may be used to simplify the result, to yield the minimum Dj, + D2 as 


M 
D, + Dy =4——"~ = (28 +sin(2s)) (D30) 
n—1 20 


which has the same form as the circular case plus an extra contribution of 2 
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4. Toroidal Manifold: 3 Overlapping Posterior Probabilities 


For 
M-> 


<s< * the functional form of p(@) may be obtained directly from the circular case with the replacement 
so that 


N|Eza 


? 


fi () 0<|a)<- +s 


fo (@) Fis <lol< Sts 
PO=) 6) Ss < |) <2 +s (sp 
0 lO] > st +s 


fr (0) = } cos(@) seo (27 — 5) + a1 (1 ~cos(6) soc (3% 5) + bs cos (0) exe (25) sin (4 5) soc (3% ~s) 


fo(0) = ste (cos (0) — cot (FF) sin ()) 
fa(0) = + (1 ose (F 25) sin (SF — 5-0) ) +50 (cos ($7 -6) see (37-5) -1) 
sce 2) se a Balan ee) (om) 


D, + Dz must be stationary w.r.t. variation of p(@) in each of the 3 intervals 0 < 6 < —4% + s (interval 1), 


—*24+8<60< “*—s (interval 2), and SF —s < 0 < 5% +s (interval 3). The coefficient of the cos 20 term in interval 
2 yields 


4 
bor (n+ bor — born) cos ($7) =0 (D33) 
which has the same form as the circular case with the replacements M — uv and n > an, which has the solution 
n 
= D34 
be (n—1)r (D384) 


which may be substituted back into the coefficient of the cos @ term in interval 1 to yield 


(n — 1) (—6at + 7a, — 2) rsin (37) 
0 =r sec (Ft -°) sin (3) +(n—1) (gh aa. 4 1) rsin ($2) (D35) 
+2n (a, sin (47 - s) +(1— a) sin (54 = s)) 


and also substituted back into the coefficient of the sin term in interval 3 to yield 


0 = 27 : 20 ; 27 . 9 f 20 
= 1r cos i csc Vi s} sec Vi s) sin Vi 


—2 a (3a; — 2) cos (33 —s 
—2(a,-1) a cos (4% +s) 
—(n-1)r + (1—2a; +243) cos (8% — s 


x + (1-44; + C cos (s) (D36) 


+2n ( (a1 + 1) cos (7) — (a1 — 1) cos (ar) ) 


—2a, sin (35) sin (34 _ 2s) 


both of which have the same form as the circular case with the replacements M —> u and n > as These two 
conditions may be solved for a, and r to yield 
An 
cos (54 
qe a (p37) 
cos (47) — 1 
and 
2n 4a 20 
r= ——; cos (4 :) sec (37) (D38) 
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The solutions for a, and bz may be substituted back into the expressions for the f; (@) to reduce them to the form 


fi(@) = -4 (cos (Fi - 5) + cos s — 2cos (57) cos8) esc? (Fr) sec & = s) 
Bion ese oad @ 2 ca 0) a3) 
f3(0) = =F esc” (3) (cos (3 - 0) sec (3 7 s) 7 1) (D39) 


which have the same form as the circular case with the replacement M —> oe D, + Dz must be stationary w.r.t. 
variation of r. This yields a transcendental equation that must be satisfied by the optimum solution for s as 


n-1M 20 os An An 3 At _¢ 
5 On On sin| 47 —8 um *} Ol ag 8 = 


which has the same form as the circular case with the replacements M —> a and n > nth 


written out in full as 


1 cos (47 — s) 


n cos (32) 


(D40) 


D, and Dy may be 


fr (8) (1+ [ln (0) = rn (0)|7) 
fa (44 — 8) (1+ |]p(@) — rn (47)||”) 
+fs (47 +8) (14 |]n(@) — rn (-49)|°) 


fg oe 


M as ,, { f2() (1+ IIm(@) -rn)I!”) 
dD, = >— + f%e a0 4m 4m) ||? 
2n7t at + fo (45 - ) (1+ In(@) -rn(4)| ) 
fa (0) (1+ [ln (0) — rn (0)|”) 
+ foe, a9 fi (44 — 8) (1+ [|p (@) — rn (47)||”) 
+fs (3 — 8) (1+ |p @) — rn (§7)|]”) 
(D41) 
on, | 2 -3h@) rn : 
ig? a 3 fa (49 - 8) rn (44) 
Pe —3 fs (a7 + 8) rn (—47) 
a (os Gn n(0) — 5 fo(@) rn (0 
Dy = + fs; 1 “1 fy (48-6) on (48) | (D42) 


-3 fs (9 8) rm (3) 


The optimum /f; (9) and r may be substituted into D, + Dz, the integrations evaluated, and then the condition that 
the optimum s must satisfy may be used to simplify the result, to yield the minimum Dj, + D2 as 


Ae ORE Se aera) (Gy Oa ea rie a) 


(n—1)° (n—1)° 


Di + D2= 


25) (D43) 
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Self-Organising Stochastic Encoders 


S P Luttrell 
DERA, Malvern 


The processing of mega-dimensional data, such as images, scales linearly with image size only if 
fixed size processing windows are used. It would be very useful to be able to automate the process 
of sizing and interconnecting the processing windows. A stochastic encoder that is an extension of 
the standard Linde-Buzo-Gray vector quantiser, called a stochastic vector quantiser (SVQ), includes 
this required behaviour amongst its emergent properties, because it automatically splits the input 
space into statistically independent subspaces, which it then separately encodes. 

Various optimal SVQs have been obtained, both analytically and numerically. Analytic solutions 
which demonstrate how the input space is split into independent subspaces may be obtained when 
an SVQ is used to encode data that lives on a 2-torus (e.g. the superposition of a pair of uncorrelated 
sinusoids). Many numerical solutions have also been obtained, using both SVQs and chains of linked 
SVQs: (1) images of multiple independent targets (encoders for single targets emerge), (2) images 
of multiple correlated targets (various types of encoder for single and multiple targets emerge), (3) 
superpositions of various waveforms (encoders for the separate waveforms emerge - this is a type of 
independent component analysis (ICA)), (4) maternal and foetal ECGs (another example of ICA), 
(5) images of textures (orientation maps and dominance stripes emerge). 

Overall, SVQs exhibit a rich variety of self-organising behaviour, which effectively discovers the 
internal structure of the training data. This should have an immediate impact on “intelligent” 
computation, because it reduces the need for expert human intervention in the design of data 


processing algorithms. 


I. STOCHASTIC VECTOR QUANTISER 


A. Reference 


Luttrell S P, 1997, Mathematics of Neural Networks: 
Models, Algorithms and Applications, Kluwer, Ellacott 
S W, Mason J C and Anderson I J (eds.), A theory of 
self-organising neural networks, 240-244. 


B. Objective Function 


1. Mean Euclidean Distortion 


D= jw Pr (a) S>Pr(y jx) | ae’ Pr (a! |y) la — 2’ |? 
7] 
(1.1) 
e Encode then decode: « —> y —> a’. 


e x = input vector; y = code; x’ = reconstructed 
vector. 


e Code vector y = (y1,y2,°** ,Yn) forl < yj <M. 


*Typeset in JATRX on May 9, 2019. 

This is archived at http://arxiv.org/abs/1012.4126, 18 Dec 2010. 
This unpublished talk was given at the Workshop on Self- 
Organising Systems - Future Prospects for Computing, 28-29 Oc- 
tober 1999, Manchester, UK. 
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e Pr(z) = input PDF; Pr(y|a) = stochastic en- 
coder; Pr (a’|y) = stochastic decoder. 


e || — x'||? = Euclidean reconstruction error. 


2. Simplify 


D = 2 f de Pr() > Pr(yle) lle" (y)| 


a’ (y) = fe Pr(aly)x 
_ J da Pr(a) Pr(y|x) x 


J dae Pr (a) Pr (y [a) 2) 
e Do the f dx Pr(a|y) (---) integration. 
e x’ (y) = reconstruction vector. 
e x’ (y) is the solution of Beta) = 0, so it can be 


deduced by optimisation. 


3. Constrain 


Pr(y|@) = Pr(yi|x) Pr (y2|a) --- Pr (yn |x) 


!(y) = =o" (w) (1.3) 


e Pr(y|x) implies the components (y1, y2,-+- ; Yn) 
of y are conditionally independent given x. 


e x’ (y) implies the reconstruction is a superposition 
of contributions a’ (y;) for i = 1,2, --- ,n. 


E 


e The stochastic encoder samples n times from the 
same Pr (y|x). 


4. Upper Bound 


D< D,+D2 
M 
D, = = fdePr(e) > Pr(yle) je—2' (I (14) 
M 2 
D2 = 2(n fae Pr (a) a — 5 *Pr(y|a) 2! (y) 


e D, is a stochastic vector quantiser with the vector 
code y replaced by a scalar code y. 


e Dz is a non-linear (note Pr(y|a)) encoder with a 


superposition term pee Pr (y|a) ax’ (y). 
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e n — oo: the stochastic encoder measures Pr (y |x ) 
accurately and D2 dominates. 


en —> 1: the stochastic encoder samples Pr (y | ) 
poorly and D, dominates. 


II. ANALYTIC OPTIMISATION 
A. References 


Luttrell S P, 1999, Combining Artificial Neural Nets: 
Ensemble and Modular Multi-Net Systems, 235-263, 
Springer-Verlag, Sharkey A J C (ed.), Self-organised 
modular neural networks for encoding data. 


Luttrell S P, 1999, An Adaptive Network For Encod- 
ing Data Using Piecewise Linear Functions, Proceedings 
of the 9th International Conference on Artificial Neural 
Networks (ICANN99), Edinburgh, 7-10 September 1999, 
198-203. 


B. Stationarity Conditions 


1. Stationarity w.r.t. x’ (y) 


M 
n f de Pr(wly) ©=a" (y) +(n—1) f de Pr(wly) Yo Prly |e) 2" (y) (2.1) 
ysl 
e Stationarity condition is ene =0 
2. Stationarity w.r.t. Pr (y’ |x) 
a 1 
Pr(2) Petula) D2 Pry le) ~ byw) #).( 52°) —n NS Pry’ le) ay) )=0 22) 
f=] y= =] 
e Stationarity condition is ates = 0, subject to Su Pr (y |x) = 
e 3 types of solution: Pr(w) = 0 (trivial), Pr(y|w) = O (ensures Pr(y|a) > 0), and 
M 
Deyai (Pr (y' |x) — dy,y") (++) = 0. 
C. Circle 2. Stochastic encoder PDFs symmetrically arranged around 


1. Input vector uniformly distributed on a circle 


xz = (cos@, sin@) 
1 20 


[exPr(e) (ey) = ; dO (-++) 


= (2.3) 


the circle 


126 Pr(yld) =» (9 =) (2.4) 
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8. Reconstruction vectors symmetrically arraged around the D. 2-Torus 


circle 


1. Input vector uniformly distributed on a 2-torus 


xe’ (y)=r (cos (=) , sin (=) (2.5) x = (X1,22) 


z, = (cos, sin 61) 
4. Stochastic encoder PDFs overlap no more than 2 at a t2 = (cos, ae 62) : 
ti il 1 T 
ves je Pr(a) (+) = wf do, | da (-++2.8) 
Ar? 0 0 


2. Joint encoding 


Pr(y |x) = Pr (y|a1, v2) (2.9) 
e Pr(y|x) depends jointly on x; and a9. 
e Requires n = 1 to encode a. 


e For a given resolution the size of the codebook in- 
creases exponentially with input dimension. 


5. Stochastic encoder PDFs overlap no more than 3 at a 
time 


8. Factorial encoding 


PUl)={Profes) yee 220 


e Y; and Y2 are non-intersecting subsets of the al- 
lowed values of y. 


e Pr(y|a) depends either on x, or on a2, but not on 
both at the same time. 


e Requires n >> 1 to encode a. 


e For a given resolution the size of the codebook in- 
creases linearly with input dimension. 
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4. Stability diagram 


Fixed n, increasing M: joint encoding is eventu- 
ally favoured because the size of the codebook is 
eventually large enough. 


Fixed M, increase n: factorial encoding is even- 
tually favoured because the number of samples is 
eventually large enough. 


Factorial encoding is encouraged by using a small 
codebook and sampling a large number of times. 


/ 
/ 


11 joint encoding 


/ factorial encoding 


e N (y’) is the set of neurons that lie in a predefined 
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III. NUMERICAL OPTIMISATION 
A. References 


Luttrell S P, 1997, to appear in Proceedings of the 
Conference on Information Theory and the Brain, 
Newquay, 20-21 September 1996, The emergence of 
dominance stripes and orientation maps in a network of 
firing neurons. 


Luttrell S P, 1997, Mathematics of Neural Networks: 
Models, Algorithms and Applications, Kluwer, Ellacott 
S W, Mason J C and Anderson I J (eds.), A theory of 
self-organising neural networks, 240-244. 


Luttrell S P, 1999, submitted to a special issue of IEEE 
Trans. Information Theory on Information- Theoretic 
Imaging, Stochastic vector quantisers. 


B. Gradient Descent 


1. Posterior probability with infinite range neighbourhood 


Q (x ly) 


Pr xz) = —_ 
we) TQ (aly') 


e Q(x|y) > 0 is needed to ensure a valid Pr (y |x). 


e This does not restrict Pr (y|a) in any way. 


2. Posterior probability with finite range neighbourhood 


Q (x ly) Oyen (y’) 


Pr(y|a;y') = yey) ely) (3.1) 
1 
~. Vyr=n(y @ (@ ly") (3.2) 


“neighbourhood” of y’. 
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e \! (y) is the “inverse neighbourhood” of y defined 
as N~* (y) = {y' sy Ee N (y')}. 


e Neighbourhood is used to introduce “lateral inhibi- 
tion” between the firing neurons. 


e This restricts Pr(y|a), but allows limited range 
lateral interactions to be used. 


8. Probability leakage 


Pr(y|je)—> > Priyly’) Pr(y’|x) — (3.3) 
yEL-h(y) 


e Pr(yly’) is the amount of probability that leaks 
from location y’ to location y. 


e L(y’) is the “leakage neighbourhood” of y’. 


e £1 (y) is the “inverse leakage neighbourhood” of y 
defined as £~! (y) = {y': y € L(y')}. 


e Leakage is to allow the network output to be “dam- 
aged” in a controlled way. 


e When the network is optimised it automatically be- 
comes robust with respect to such damage. 


e Leakage leads to topographic ordering according to 
the defined neighbourhood. 


e This restricts Pr (y |x), but allows topographic or- 
dering to be obtained, and is faster to train. 


4. Shorthand notation 


Ly yt = Prty ly’) 


Py = Se, Pyty 
yEN~*(y) 

d, =a —z' (y) 
(PLd),= S-> Pyy (Ld), 
y'EN(y) 

2 
ey = lla — a’ (y)| 


(PLe),, = S> Pyy (Ley 
y'EN(y) 
M 

d=) (L*p), dy 


y=1 


Pyy = Pry |x; y') 


(L* P), > Ly y Py! 


y’EL—1(y) 
(L d), a SS Ly! y dy 
y'EL(y) 
T _ 
(P™PLd) = >) Pyy(PLd)y 
yEN-1(y) 
(L ey, = S- Ly yy! Cy! 
y’EL(y) 
(PUPLe= 1 So. Pag (Plea 
y’EN~l(y) 
M 
ord=) (PLd), (3.4) 
y=l1 


e This shorthand notation simplifies the appearance of the gradients of D, and Dp. 


e For instance, Pr(y|a) = 34 (L7p),,- 


5. Derivatives w.r.t. x’ (y) 


OD, 


x’ (y) 
OD, _ 4 


x’ (y) 


1 


= az fe Pr (x) (L* p) , dy 


mw i; da Pr(a) (L’p), d (3.5) 


e The extra factor + in 222 arises because there is a ee (-..) hidden inside the d. 


mM ™ 27) 
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6. Functional derivatives w.r.t. log Q (a |y) 


2 a T 
M 
6D2 = SOD fae BH) eee) (ry (Ld), -(P™PLa),).d 


e Differentiate w.r.t. log Q (a|y) because Q (a |y) > 0. 


7. Neural response model 


1 
ON) is ean ee a) a 


e This is a standard “sigmoid” function. 


e This restricts Pr (y|a), but it is easy to implement, and leads to results similar to the ideal analytic results. 


8. Derivatives w.r.t. w(y) and b(y) 


7209) = [a Pr(z) (v, (Le), - (PT PLe),) (1— Q (aly) e (3.8) 
wly 
OD» 4(n—1) ‘ 1 
0( mG) -—3 jp Pr (2) (py (L4), — (P™ PLd),) a (1— Q(ely)) (3) 
w (y) 
C. Circle e M =4 and n = 10 were used. 


1. Training history 


0.5) 


e The reference vectors x’ (y) (for y = 1,2,3,4) are 
initialised close to the origin. 


0.0: 


e The training history leads to stationary a’ (y) just 
outside the unit circle. 


05 00 05 1.0 15. 1 
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2. Posterior probabilities 


e Each of the posterior probabilities Pr(y|a) (for 
y = 1,2,3,4) is large mainly in a 4 radian arc of 
the circle. 


e There is some overlap between the Pr (y |x ). 


D. 2-Torus 


1. Posterior probabilities: joint encoding 


a 
q 


e M =8 and n = 5 were used, which lies inside the 
joint encoding region of the stability diagram. 


e Each of the posterior probabilities Pr (y |a ) is large 
mainly in a localised region of the torus. 


e There is some overlap between the Pr (y |x). 
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t[x] 


2. Posterior probabilities: factorial encoding 


Bn 


e M =8 and n = 20 were used, which lies inside the 
factorial encoding region of the stability diagram. 


e Each of the posterior probabilities Pr (y|a ) is large 
mainly in a collar-shaped region of the torus; half 
circle one way round the torus, and half the other 
way. 


e There is some overlap between the Pr(y|a) that 
circle the same way round the torus. 


e There is a localised region of overlap between a pair 
of Pr (y|a) that circle the opposite way round the 
torus. 

e These localised overlap regions are the mechanism 


by which factorial encoding has a small reconstruc- 
tion distortion. 


E. Multiple Independent Targets 


1. Training data 


e The targets were unit height Gaussian bumps with 
o=2. 


e The additive noise was uniformly distributed vari- 
ables in [0, 0.5]. 


2. Factorial encoding 


e M =10 and n = 10 were used. 


e Each of the reference vectors a’ (y) becomes large 
in a localised region. 


e Each input vector causes a subset of the neurons to 
fire corresponding to locations of the targets. 


e This is a factorial encoder because each neuron re- 
sponds to only a subspace of the input. 


F. Pair of Correlated Targets 


1. Training data 


= 
oO 


0 5 10 415 20 
vector element 


e The targets were unit height Gaussian bumps with 
o0=1.5. 
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2. Training history: joint encoding 


e M = 16 and n =3 were used. 


e Each of the reference vectors x’ (y) becomes large 
in a pair of localised regions. 


e Each neuron responds to a small range of positions 
and separations of the pair of targets. 


e The neurons respond jointly to the position and 
separation of the targets. 


8. Training history: factorial encoding 


e M = {16,16} and n = {20, 20} were used; this is a 
2-stage encoder. 


e The second encoder uses as input the posterior 
probability output by the first encoder. 
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e The objective function is the sum of the separate G. Separating Different Waveforms 
encoder objective functions (with equal weighting 


given to each). 1. Training data 


e The presence of the second encoder affects the opti- 3} | 
misation of the first encoder via “self-supervision”. | 
W E 4 

e Each of the reference vectors a’ (y) becomes large 1F : 


| iN) 


eA | 


0 100 200 300 400 
X 


in a single localised region. rey 


4. Training history: invariant encoding 


e This data is the superposition of a pair of wave- 
forms plus noise. 


e In each training vector the relative phase of the two 
waveforms is randomly selected. 


2. Training history: factorial encoding 


M = {16,16} and n = {3,3} were used; this is a 
2-stage encoder. 


2b 2 ao) 15 1 

TAAAMD ‘ANAA ThA OS AN iy 8] AN A A 
During training the ratio of the weighting assigned “1 vig a ap abo _1] Yppo'ibolybo\ybo 24.g) Who ABP ool gio. <5 TOYO AF #99 1 gp rohienypboniann 
to the first and second encoders is increased from : 


1:5 to 1:40. th AA lk wl A A ELA gd FA AA 
42! BiyBoo'po 0) "1° pO SBef400 | \ p Koo 1.9] Woon F001 0 Tpszocigo abo 


e Each of the reference vectors a’ (y) becomes large 
in a single broad region. 


e M =10 and n = 20 were used. 


e Each of the reference vectors a’ (y) becomes one 


e Each neuron responds only the position (and not or other of the two waveforms, and has a definite 
the separation) of the pair of targets. phase. 

e The response of the neurons is invariant w.r.t. the e Each neuron responds to only one of the waveforms, 
separation of the targets. and then only when its phase is in a localised range. 
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H. Maternal + Foetal ECG I. Visual Cortex Network (VICON) 


1. Training data 1. Training Data 


e This data is an 8-channel ECG recording taken 
from a pregnant woman. 


e The large spikes are the woman’s heart beat. 


e The noise masks the foetus’ heartbeat. 


e This data was whitened before training the neural a 
network. at : : 
e This is a Brodatz texture image, whose spatial cor- 
relation length is 5-10 pixels. 


2. Factorial Encoding 
2. Orientation map 


e M = 16 and n = 20 were used. 


The results shown are w(y).« computed for all neu- 


rons (y = 1,2, --- ,8) for each 8-dimensional input 
vector &. 
e After limited training some, but not all, of the neu- e M =30 x 30 and n= 1 were used. 


rons have converged. 
e Input window size = 17 x 17, neighbourhood size 


e The broadly separated spikes indicate a neuron = 9 x 9, leakage neighbourhood size = 3 x 3 were 
that responds to the mother’s heartbeat. used. 
e Leakage probability was sampled from a 2- 
e The closely separated spikes indicate a neuron that dimensional Gaussian PDF, with o = 1 in each 
responds to the foetus’ heartbeat. direction. 
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e Each of the reference vectors x’ (y) typically looks e Interdigitate a pair of training images, so that one 
like a small patch of image. occupies on the black squares, and the other the 


; . ; white squares, of a “chess board”. 
e Leakage induces topographic ordering across the ar- 


ray of neurons 


e This makes the array of reference vectors look like 
an “orientation map”. 
e Preprocess this interdigitated image to locally nor- 


malise it using a finite range neighbourhood. 
e M =100 x 100 and n = 1 were used. 


3. Sparse coding 


e Input window size = 3 x 3, neighbourhood size = 
5x5, leakage neighbourhood size = 3 x 3 were used. 


e The trained network is used to encode and decode 
a typical input image. 


Geer mre e Leakage probability was sampled from a 2- 


dimensional Gaussian PDF, with o = 1 in each 
e Middle image = posterior probability. This shows direction. 
“sparse coding” with a small number of “activity 
bubbles”. 


e Right image = reconstruction. Apart from edge 


effects, this is a low resolution version of the input. 
e The dominance stripe map records for each neuron 


which of the 2 interdigitated images causes it to 
4. Dominance stripes respond more strongly. 


e The dominance stripes tend to run perpendicularly 
into the boundaries, because the neighbourhood 
window is truncated at the edge of the array. 
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Stochastic Vector Quantisers * 


S P Luttrellt 
Defence Evaluation and Research Agency, St Andrews Rd, Malvern, Worcs, 
WR14 8PS, United Kingdom, tel: +44 (0) 1684-894046, fax: +44 (0) 1684-894384 


In this paper a stochastic generalisation of the standard Linde-Buzo-Gray (LBG) approach to 
vector quantiser (VQ) design is presented, in which the encoder is implemented as the sampling of a 
vector of code indices from a probability distribution derived from the input vector, and the decoder 
is implemented as a superposition of reconstruction vectors, and the stochastic VQ is optimised 
using a minimum mean Euclidean reconstruction distortion criterion, as in the LBG case. Numerical 
simulations are used to demonstrate how this leads to self-organisation of the stochastic VQ, where 
different stochastically sampled code indices become associated with different input subspaces. This 
property may be used to automate the process of splitting high-dimensional input vectors into 


low-dimensional blocks before encoding them. 


I. INTRODUCTION 


In vector quantisation a code book is used to encode 
each input vector as a corresponding code index, which 
is then decoded (again, using the codebook) to produce 
an approximate reconstruction of the original input vec- 
tor [2, 3]. The purpose of this paper is to generalise the 
standard approach to vector quantiser (VQ) design [7], 
so that each input vector is encoded as a vector of code 
indices that are stochastically sampled from a probabil- 
ity distribution that depends on the input vector, rather 
than as a single code index that is the deterministic out- 
come of finding which entry in a code book is closest 
to the input vector. This will be called a stochastic VQ 
(SVQ), and it includes the standard VQ as a special case. 
Note that this approach is different from the various soft 
competition and stochastic relaxation schemes that are 
used to train VQs (see e.g. [13]), because here the prob- 
ability distribution is an essential part of the encoder, 
both during and after training. 


One advantage of using the stochastic approach, which 
will be demonstrated in this paper, is that it automates 
the process of splitting high-dimensional input vectors 
into low-dimensional blocks before encoding them, be- 
cause minimising the mean Euclidean reconstruction er- 
ror can encourage different stochastically sampled code 
indices to become associated with different input sub- 
spaces [11]. Another advantage is that it is very easy to 
connect SVQs together, by using the vector of code index 
probabilities computed by one SVQ as the input vector 
to another SVQ [10]. 


In section II various pieces of previously published the- 
ory are unified to give a coherent account of SVQs. In 
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section IIT the results of some new numerical simulations 
are presented, which demonstrate how the code indices 
in a SVQ can become associated in various ways with 
input subspaces. 


II. THEORY 


In this section various pieces of previously published 
theory are unified to establish a coherent framework for 
modelling SVQs. 

In section II A the basic theory of folded Markov chains 
(FMC) is given [8], and in section IIB it is extended to 
the case of high-dimensional input data [9]. In section 
IIC some properties of the solutions that emerge when 
the input vector lives on a 2-torus are summarised [11]. 
Finally, in section IID the theory is further generalised 
to chains of linked FMCs [10]. 


A. Folded Markov Chains 


The basic building block of the encoder/decoder model 
used in this paper is the folded Markov chain (FMC) [8]. 
Thus an input vector x is encoded as a code index vector 
y, which is then subsequently decoded as a reconstruction 
x’ of the input vector. Both the encoding and decoding 
operations are allowed to be probabilistic, in the sense 
that y is a sample drawn from Pr (y|x), and x’ is asample 
drawn from Pr (x’|y), where Pr(y|x) and Pr(x|y) are 
Bayes’ inverses of each other, as given by 


__Pe(yix) Pr(x) 
J dz Pr (y|z) Pr (z) 


Pr (xly) (2.1) 


and Pr (x) is the prior probability from which x was sam- 
pled. 

Because the chain of dependences in passing from x to 
y and then to x’ is first order Markov (i.e. it is described 
by the directed graph x —> y —> x’), and because 
the two ends of this Markov chain (i.e. x and x’) live 
in the same vector space, it is called a folded Markov 


encode 


decode 


Figure 1: A folded Markov chain (FMC) in which an input 
vector x is encoded as a code index vector y that is drawn 
from a conditional probability Pr (y|x), which is then decoded 
as a reconstruction vector x’ drawn from the Bayes’ inverse 
conditional probability Pr (x’|y). 


chain (FMC). The operations that occur in an FMC are 
summarised in figure 1. 

In order to ensure that the FMC encodes the input vec- 
tor optimally, a measure of the reconstruction error must 
be minimised. There are many possible ways to define 
this measure, but one that is consistent with many pre- 
vious results, and which also leads to many new results, 


is the mean Euclidean reconstruction error measure D, 
which is defined as 


M 
D= ic Pr Go). Se Pr (ylx) fax Pr (x'|y) IIx — x!|? 


(2.2) 
where Pr (x) Pr(y|x) Pr(x’ly) is the joint probability 
that the FMC has state (x,y,x’), |x —x’||? is the Eu- 
clidean reconstruction error, and f dx Saye , { ax’ (--+) 
sums over all possible states of the FMC (weighted by 
the joint probability). The code index vector y is as- 
sumed to lie on a rectangular lattice of size M. 

The Bayes’ inverse probability Pr (x’|y) may be inte- 
grated out of this expression for D to yield 


M 
D=2 f dx Pr(x) > Pr(ylx) |x-x I? 23) 


where the reconstruction vector x’(y) is defined as 
x’ (y) = { dx Pr(x|y) x. Because of the quadratic form 
of the objective function, it turns out that x’ (y) may 
be treated as a free parameter whose optimum value (i.e. 
the solution of ay = 0) is { dx Pr (x|y) x, as required. 

If D is now minimised with respect to the probabilis- 
tic encoder Pr (y|x) and the reconstruction vector x’ (y), 
then the optimum has the form 


Pr (y|x) ss dy v(x) 
arg min 2 
y(x) = USE xx’ (y)I| 


y 
Sty) [ew Pr (xly) x (2.4) 


where Pr(y|x) has reduced to a deterministic encoder 
(as described by the Kronecker delta dy yx)), y (x) is a 
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nearest neighbour encoding algorithm using code vectors 
x’ (y) to partition the input space into code cells, and (in 
an optimised configuration) the x’ (y) are the centroids 
of these code cells. This is equivalent to a standard VQ 
[7]. 

An extension of the standard VQ to the case of where 
the code index is transmitted along a noisy communica- 
tion channel before the reconstruction is attempted [1, 6] 
was derived in [8], and was shown to lead to a good ap- 
proximation to a topographic mapping neural network 
[5]. 


B. High Dimensional Input Spaces 


A problem with the standard VQ is that its code book 
grows exponentially in size as the dimensionality of the 
input vector is increased, assuming that the contribution 
to the reconstruction error from each input dimension is 
held constant. This means that such VQs are useless for 
encoding extremely high dimensional input vectors, such 
as images. The usual solution to this problem is to parti- 
tion the input space into a number of lower dimensional 
subspaces (or blocks), and then to encode each of these 
subspaces separately. However, this produces an unde- 
sirable side effect, where the boundaries of the blocks are 
clearly visible in the reconstruction; this is the origin of 
the blocky appearance of reconstructed images, for in- 
stance. 

There is also a much more fundamental objection to 
this type of partitioning, because the choice of blocks 
should ideally be decided in such a way that the corre- 
lations within a block are much stronger than the cor- 
relations between blocks. In the case of image encoding, 
this will usually be the case if the partitioning is that 
each block consists of contiguous image pixels. However, 
more generally, there is no guarantee that the input vec- 
tor statistics will respect the partitioning in this conve- 
nient way. Thus it will be necessary to deduce the best 
choice of blocks from the training data. 

In order to solve the problem of finding the best parti- 
tioning, consider the following constraint on Pr (y|x) and 
x’ (y) in the FMC objective function in equation 2.3 


(Y1,Y2.°°* Yn) Ll<y<M 
Pr(y|x) = Pr(yi|x) Pr (yo|x) --- Pr (nlx) 


xy) = = 30x Ww) (2.5) 


Thus the code index vector y is assumed to be n- 
dimensional, each component y; (for i = 1,2,---,n and 
1 < y; < M) is an independent sample drawn from 
Pr(y|x), and the reconstruction vector x’ (y) (vector ar- 
gument) is assumed to be a superposition of n contribu- 
tions x’ (y;) (scalar argument) for 7 = 1,2,---,n. As D 
is minimised, this constraint allows partitioned solutions 
to emerge by a process of self-organisation. 
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output 


input 


Figure 2: A typical solution in which the components of the 
input vector x and the range of values of the output code 
index y are both partitioned into blocks. These input and 
output blocks are connected together as illustrated. More 
generally, the blocks may overlap each other. 


For instance, solutions can have the structure illus- 
trated in figure 2. This type of structure is summarised 
as follows 


(2.6) 
HEE (YnlXp(un)) 


,Xn) 
Pr (y1lXpcn)) EE (yalXp(y2)) Pal 


xX = (X1,X2,°°- 
Pr(y|x) = 


In this type of solution the input vector x is partitioned 
as (X1,X2,-': ,Xw), the probability Pr(y|x) reduces to 
Pr (ylXpcy)) which depends only on x,:,), where the func- 
tion p(y) computes the index of the block which code 
index y inhabits. There is not an exact correspondence 
between this type of partitioning and that used in the 
standard approach to encoding image blocks, because 
here the n code indices are spread at random over the 
N image blocks, which does not guarantee that every 
block is encoded. Although, for a given N, if n is cho- 
sen to be suffiently large, then there is a virtual certainty 
that every block is encoded. This is the price that has 
to be paid when it is assumed that the code indices are 
drawn independently. 

The constraints in equation 2.5 prevent the full space of 
possible values of Pr (y|x) or x’ (y) from being explored 
as D is minimised, so they lead to an upper bound D,+D 
on the FMC objective function D (i.e. D < D, + Da), 
which may be derived as [9] 


M 
Dr = 2 fax Pro) DPr(vie) xx! WIP 27) 
e4 2 
Dz = 2A) fax Pro x — 5° Pr(ylx) x (y) 


Note that M and n are model order parameters, whose 
values need to be chosen appropriately for each encoder 
optimisation problem. 

For n = 1 only the D, term contributes, and it is 
equivalent to the FMC objective function D in equation 
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2.3 with the vector code index y replaced by a scalar code 
index y, so its minimisation leads to a standard vector 
quantiser as in equation 2.4, in which each input vector 
is approximated by a single reconstruction vector x’ (y). 

When n becomes large enough that D2 dominates 
over D,, the optimisation problem reduces to minimi- 
sation of the mean Euclidean reconstruction error (ap- 
proximately). This encourages the approximation x ~ 
ay Pr (y|x) x’ (y) to hold, in which the input vector 
x is approximated as a weighted (using weights Pr (y|x)) 
sum of many reconstruction vectors x’ (y). In numerical 
simulations this has invariably led to solutions which are 
a type of principal components analysis (PCA) of the in- 
put vectors, where the expansion coefficients Pr (y|x) are 
constrained to be non-negative and sum to unity. Also, 
the approximation x ~ 4 Pr (y|x) x’ (y) is very good 
for solutions in which Pr (y|x) depends on the whole of 
x, rather than merely on a subspace of x, so this does 
not lead to a partitioned solution. 

For intermediate values of n, where both D, and D2 
are comparable in size, partitioned solutions can emerge. 
However, in this intermediate region the properties of 
the optimum solution depend critically on the interplay 
between the statistical properties of the training data 
and the model order parameters M and n. To illustrate 
how the choice of M and n affects the solution, the case 
of input vectors that live on a 2-torus is summarised in 
section ITC. 

When the full FMC objective function in equation 2.3 
is optimised it leads to the standard (deterministic) VQ 
in equation 2.4. However, it turns out that the con- 
strained FMC objective function D; + D2 in equation 
2.7 does not allow a deterministic VQ to emerge (except 
in the case n = 1), because a more accurate reconstruc- 
tion can be obtained by allowing more than one code 
index to be sampled for each input vector. Because of 
this behaviour, in which the encoder is stochastic both 
during and after training, this type of constrained FMC 
will be called a SVQ. 


C. Example: 2-Torus Case 


A scene is defined as a number of objects at speci- 
fied positions and orientations, so is it specified by a 
low-dimensional vector of scene coordinates, which are 
effectively the intrinsic coordinates of a low-dimensional 
manifold. An image of that scene is an embedding of the 
low-dimensional manifold in the high-dimensional space 
of image pixel values. Because the image pixel values are 
non-linearly related to the vector of scene coordinates, 
this embedding operation distorts the manifold so that 
it becomes curved. The problem of finding the optimal 
way to encode images may thus be viewed as the problem 
of finding the optimal way to encode curved manifolds, 
where the instrinsic dimensionality of the manifold is the 
same as the dimensionality of the vector of scene coordi- 
nates. 


Figure 3: A typical probability Pr(y|x) (only one y value 
is illustrated) for encoding a 2-torus using a small value of 
n. This defines a smoothly tapered localised region on the 
2-torus. The toroidal mesh serves only to help visualise the 
2-torus. 


The simplest curved manifold is the circle, and the next 
most simple curved manifold is the 2-torus (which has 2 
intrinsic circular coordinates). By making extensive use 
of the fact that the optimal form of Pr (y|x) must be a 
piecewise linear function of the input vector x [10], the 
optimal encoders for these manifolds have been derived 
analytically [11]. The toroidal case is very interesting 
because it demonstrates the transition between the un- 
partitioned and partitioned optimum solutions as n is 
increased. A 2-torus is a realistic model of the manifold 
generated by the linear superposition of 2 sine waves, 
having fixed amplitudes and wavenumbers. The phases 
of the 2 sine waves are then the 2 intrinsic circular co- 
ordinates of this manifold, and if these phases are both 
uniformly distributed on the circle, then Pr (y|x) defines 
a constant probability density on the 2-torus. 

A typical Pr (y|x) for small n is illustrated in figure 3. 
Only in the case where n = 1 would Pr (y|x) correspond 
to a sharply defined code cell; for n > 1 the edges of the 
code cells are tapered so that they overlap with one other. 
The 2-torus is covered with a large number of these over- 
lapping code cells, and when a code index y is sampled 
from Pr (y|x), it allows a reconstruction of the input to be 
made to within an uncertainty area commensurate with 
the size of a code cell. This type of encoding is called 
joint encoding, because the 2 intrinsic dimensions of the 
2-torus are simultaneously encoded by y. 

A typical pair of Pr (y|x) (i.e. Pr (yi|x) and Pr (y2|x)) 
for large n is illustrated in figure 4. The partitioning 
splits the code indices into two different types: one type 
encodes one of the intrinsic dimensions of the 2-torus, 
and the other type encodes the other intrinsic dimension 
of the 2-torus, so each code cell is a tapered collar-shaped 
region. When the code indices (yi, y2,--: , Yn) are sam- 
pled from Pr (y|x), they allow a reconstruction of the 
input to be made to within an uncertainty area commen- 
surate with the size of the region of intersection of a pair 
of orthogonal code cells, as illustrated in figure 4. This 
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Figure 4: Two typical probabilities Pr (yi|x) and Pr (y2|x) 
for encoding a 2-torus using a large value of n. Each sepa- 
rately defines a smoothly tapered collar-shaped region on the 
2-torus. However, taken together, their region of intersection 
defines a smoothly tapered localised region on the 2-torus. 
The toroidal mesh serves only to help visualise the 2-torus. 


type of encoding is called factorial encoding, because the 
2 intrinsic dimensions of the 2-torus are separately en- 
coded in (yi, Y2; eee »Yn)- 

For a 2-torus there is an upper limit M = 12 beyond 
which the optimum solution is always a joint encoder 
(as shown in figure 3). This limit arises because when 
M = 12 the code book is sufficiently large that the joint 
encoder gives the best reconstruction for all values of n. 
This result critically depends on the fact that as n is 
increased the code cells overlap progressively more and 
more, so the accuracy of the reconstruction progressively 
increases. For M = 12 the rate at which the accuracy of 
the joint encoder improves (as n increases) is sufficiently 
great that it is always better than that of the factorial 
encoder (which also improves as n increases). 


D. Chains of Linked FMCs 


Thus far it has been shown that the FMC objective 
function in equation 2.3, with the constraints imposed in 
equation 2.5, leads to useful properties, such as the auto- 
matic partitioning of the code book to yield the factorial 
encoder, such as that illustrated in figure 4 (and more 
generally, as illustrated in figure 2). The free parameters 
(M,n) (i.e. the size of the code book, and the number of 
code indices sampled) can be adjusted to obtain an opti- 
mal solution that has the desired properties (e.g. a joint 
or a factorial encoder, as in figures 3 and 4, respectively). 
However, since there are only 2 free parameters, there is 
a limit to the variety of types of properties that the op- 
timal solution can have. It would thus be very useful to 
introduce more free parameters. 

The FMC illustrated in figure 1 may be generalised 
to a chain of linked FMCs as shown in figure 5. Each 
stage in this chain is an FMC of the type shown in figure 
1, and the vector of probabilities (for all values of the 
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Figure 5: A chain of linked FMCs, in which the output from each stage is its vector of posterior probabilities (for all values of 
the code index), which is then used as the input to the next stage. Only 3 stages are shown, but any number may be used. 
More generally, any acyclically linked network of FMCs may be used. 


code index) computed by each stage is used as the input 
vector to the next stage; there are other ways of linking 
the stages together, but this is the simplest possibility. 
The overall objective function is a weighted sum of the 
FMC objective functions derived from each stage. The 
total number of free parameters in an LD stage chain is 
3D — 1, which is the sum of 2 free parameters for each of 
the L stages, plus L — 1 weighting coefficients; there are 
£—1 rather than L weighting coefficients because the 
overall normalisation of the objective function does not 
affect the optimum solution. 


The chain of linked FMCs may be expressed mathe- 
matically by first of all introducing an index | to allow 
different stages of the chain to be distinguished thus 


M > MO y—> yO 
x — x) x’ —> x(0! 
n— nO D— D” 
D, 3 D® D, — D® (2.8) 


The stages are then defined and linked together thus (the 
detailed are given only as far as the input to the third 
stage) 


ZO 2G 2 hy) 


SOS se (aD ap occa) 

g) = Pr (y) = ix) 1<i< MO 
x) 4 y) —s xy 

2) = Pr (y® = ix) 1<i< M® (29) 


The objective function and its upper bound are then 
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given by 
L 
D- yD 
1=1 
< Di, + Dez 


L 
= 0s (oP + dP) (2.10) 


l=1 


where s“) > 0 is the weighting that is applied to the 
contribution of stage | of the chain to the overall objective 
function. 

The piecewise linearity property enjoyed by Pr (y|x) in 
a single stage chain also holds for all of the Pr (y |x) in 
a multi-stage chain, provided that the stages are linked 
together as prescribed in equation 2.9 [10]. This will 
allow optimum analytic solutions to be derived by an 
extension of the single stage methods used in [11]. 


III. SIMULATIONS 


In this section the results of various numerical simula- 
tions are presented, which demonstrate some of the types 
of behaviour exhibited by an encoder that consists of a 
chain of linked FMCs. Synthetic, rather than real, train- 
ing data are used in all of the simulations, because this 
allows the basic types of behaviour to be cleanly demon- 
strated. 

In section HITA the training algorithm is presented. 
In section IIIB the training data is described. In sec- 
tion IIIC a single stage encoder is trained on data that 
is a superposition of two randomly positioned objects. 
In section III D this is generalised to objects with corre- 
lated positions, and three different types of behaviour are 
demonstrated: factorial encoding using both a 1-stage 
and a 2-stage encoders (section IIID 1), joint encoding 
using a l-stage encoder (section HID 2), and invariant 
encoding (i.e. ignoring a subspace of the input space 
altogether) using a 2-stage encoder (section IIID 3). 


A. Training Algorithm 


Assuming that Pr(y|x) is modelled as in appendix 


A (ie. Priyx) = sy and Q(yix) = 
See ), then the partial derivatives of D, + 


Dz with respect to the 3 types of parameters in a single 
stage of the encoder may be denoted as 


Sw (ly) = arene 
gy) = Os Pa) ~ 2) 
8. (y) = notes (3.1) 


This may be generalised to each stage of a multi-stage en- 
coder by including an (1) superscript, and ensuring that 
for each stage the partial derivatives include the addi- 
tional contributions that arise from forward propagation 
through later stages; this is essentially an application 
of the chain rule of differentiation, using the derivatives 

OxCtD and OxltD 
dw (yO) BO (yO 
appendix A). 

A simple algorithm for updating these parameters is 
(omitting the (J) superscript, for clarity) 


} to link the stages together (see 


w G) 4 wae 
Jw,0 
b (y) x b ( € 9b (y) 
Jb,0 
x! (y) > x’ (y) (2) 


where ¢€ is a small update step size parameter, and the 
three normalisation factors are defined as 


_ max, flew @IP 
oi y ~ dimx — 
max 
= b 
no = ™* Gy) 
_ max, |Ilge (y)II" 
eran y ~ dimx — (3.3) 


The aclu and Soly) factors ensure that the maximum 


update step size for w (y) and x’ (y) is edimx (ie. € per 


( 


dimension), and the at factor ensures that the max- 


imum update step size for b(y) is c«. This update al- 
gorithm can be generalised to use a different ¢ for each 
stage of the encoder, and also to allow a different ¢ to be 
used for each of the 3 types of parameter. Furthermore, 
the size of ¢ can be varied as training proceeds, usually 
starting with a large value, and then gradually reducing 
its size as the solution converges. It is not possible to give 
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Figure 6: An example of a typical training vector for M = 24. 
Each object is a Gaussian hump with a half-width of 1.5 units, 
and peak amplitude of 1. The overall input vector is formed 
as a linear superposition of the 2 objects. Note that the input 
vector is wrapped around circularly to remove minor edge 
effects that would otherwise arise. 


general rules for exactly how to do this, because training 
conditions depend very much on the statistical properties 
of the training set. 


B. Training Data 


The key property that this type of self-organising en- 
coder exhibits is its ability to automatically split up high- 
dimensional input spaces into lower-dimensional sub- 
spaces, each of which is separately encoded. For instance, 
see section IIB for a summary of the analytically solved 
case of training data that lives on a simple curved man- 
ifold (ie. a 2-torus). This self-organisation manifests 
itself in many different ways, depending on the interplay 
between the statistical properties of the training data, 
and the 3 free parameters (i.e. the code book size M, 
the number of code indices sampled n, and the stage 
weighting s) per stage of the encoder (see section IID). 
However, it turns out that the joint and factorial en- 
coders (of the same general type as those obtained in the 
case of a 2-torus) are also the optimum solutions for more 
general curved manifolds. 

In order to demonstrate the various different basic 
types of self-organisation it is necessary to use synthetic 
training data with controlled properties. All of the types 
of self-organisation that will be demonstrated in this pa- 
per may be obtained by training a 1-stage or 2-stage en- 
coder on 24-dimensional data (i.e. M = 24) that con- 
sists of a superposition of a pair of identical objects (with 
circular wraparound to remove edge effects), such as is 
shown in figure 6. 

The training data is thus uniformly distributed on a 
manifold with 2 intrinsic circular coordinates, which is 
then embedded in a 24-dimensional image space. The 
embedding is a curved manifold, but is not a 2-torus, 
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and there are two reasons for this. Firstly, even though 
the manifold has 2 intrinsic circular coordinates, the 
non-linear embedding distorts these circles in the 24- 
dimensional embedding space so that they are not pla- 
nar (i.e. the profile of each object lives in the full 24- 
dimensional embedding space). Secondly, unlike a 2- 
torus, each point on the manifold maps to itself under 
interchange of the pair of circular coordinates, so the 
manifold is covered twice by a 2-torus (i.e. the objects are 
identical, so it makes no difference if they are swapped 
over). However, these differences do not destroy the gen- 
eral character of the joint and factorial encoder solutions 
that were obtained in section ITB. 


In the simulations presented below, two different meth- 
ods of selecting the object positions are used: either the 
positions are statistically independent, or they are corre- 
lated. In the independent case, each object position is a 
random integer in the interval [1,24]. In the correlated 
case, the first object position is a random integer in the 
interval [1,24], and the second object position is chosen 
relative to the first one as an integer in the range [4, 8], 
so that the mean object separation is 6 units. 


C. Independent Objects 


The simplest demonstration is to let a single stage en- 
coder discover the fact that the training data consists of 
a superposition of a pair of objects, which is an example 
of independent component analysis (ICA) or blind sig- 
nal separation (BSS) [4]. This may readily be done by 
setting the parameters values as follows: code book size 
M = 16, number of code indices sampled n = 20, ¢ = 0.2 
for 250 training steps, « = 0.1 for a further 250 training 
steps. 


The self-organisation of the 16 reconstruction vectors 
as training progresses (measured down the page) is shown 
in figure 7. 


After some initial confusion, the reconstruction vec- 
tors self-organise so that each code index corresponds to 
a single object at a well defined location. This behaviour 
is non-trivial, because each training vector is a superpo- 
sition of a pair of objects at independent locations, so 
typically more than one code index must be sampled by 
the encoder, which is made possible by the relatively large 
choice n = 20. This result is a factorial encoder, because 
the objects are encoded separately. This is a rudimen- 
tary example of the type of solution that was illustrated 
in figure 2, although here the blocks overlap each other. 


The case of a joint encoder requires a rather large code 
book when the objects are independent. However, when 
correlations between the objects are introduced then the 
code book can be reduced to a manageable size, as will 
be demonstrated in the next section. 
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Figure 7: A factorial encoder emerges when a single stage 
encoder is trained on data that is a superposition of 2 objects 
in independent locations. 


D. Correlated Objects 


A more interesting situation arises if the positions of 
the pair of objects are mutually correlated, so that the 
training data is non-uniformly distributed on a manifold 
with 2 intrinsic circular coordinates. The pair of objects 
can then be encoded in 3 fundamentally different ways: 


1. Factorial encoder. This encoder ignores the cor- 
relations between the objects, and encodes them 
as if they were 2 independent objects. Each code 
index would thus encode a single object position, 
so many code indices must be sampled in order to 
virtually guarantee that both object positions are 
encoded. This result is a type of independent com- 
ponent analysis (ICA) [4]. 


2. Joint encoder. This encoder regards each possible 
joint placement of the 2 objects as a distinct config- 
uration. Each code index would thus encode a pair 
of object positions, so only one code index needs 
to be sampled in order to guarantee that both ob- 
ject positions are encoded. This result is basically 
the same as what would be obtained by using a 
standard VQ [7]. 


3. Invariant encoder. This encoder regards each pos- 
sible placement of the centroid of the 2 objects 
as a distinct configuration, but regards all possi- 
ble object separations (for a given centroid) as be- 
ing equivalent. Each code index would thus encode 
only the centroid of the pair of objects. This type 
of encoder does not arise when the objects are in- 
dependent. This is similar to self-organising trans- 
formation invariant detectors described in [12]. 


Each of these 3 possibilities is shown in figure 8, where 
the diagrams are meant only to be illustrative. The cor- 
related variables live in the large 2-dimensional rectan- 
gular region extending from bottom-left to top-right of 
each diagram. For data of the type shown in figure 6, 
the rectangular region is in reality the curved manifold 
generated by varying the pair of object coordinates, and 
the invariance of the data under interchange of the pair 
of object coodinates means that the upper left and lower 
right halves of each diagram cover the manifold twice. 

The factorial encoder has two orthogonal sets of long 
thin rectangular code cells, and the diagram shows how 
a pair of such cells intersect to define a small square code 
cell. The joint encoder behaves as a standard vector 
quantiser, and is illustrated as having a set of square code 
cells, although their shapes will not be as simple as this 
in practice. The invariant encoder has a set of long thin 
rectangular code cells that encode only the long diagonal 
dimension. 

In all 3 cases there is overlap between code cells. In 
the case of the factorial and joint encoders the overlap 
tends to be only between nearby code cells, whereas in 
the case of an invariant encoder the range of the overlap 
is usually much greater, as will be seen in the numerical 
simulations below. In practice the optimum encoder may 
not be a clean example of one of the types illustrated in 
figure 8, as will also be seen in the numerical simulations 
below. 


1. Factorial Encoding 


A factorial encoder may be trained by setting the pa- 
rameter values as follows: code book size M = 16, num- 
ber of code indices sampled n = 20, ¢ = 0.2 for 500 train- 
ing steps, ¢ = 0.1 for a further 500 training steps. This 
is the same as in the case of independent objects, except 
that the number of training steps has been doubled. 

The result is shown in figure 9 which should be com- 
pared with the result for independent objects in figure 7. 
The presence of correlations degrades the quality of this 
factorial code relative to the case of independent objects. 
The contamination of the factorial code takes the form 
of a few code indices which respond jointly to the pair of 
objects. 

The joint coding contamination of the factorial code 
can be reduced by using a 2-stage encoder, in which the 
second stage has the same values of M and n as the 
first stage (although identical parameter values are not 
necessary), and (in this case) both stages have the same 
weighting in the objective function (see equation 2.10). 

The results are shown in figure 10. The reason that 
the second stage encourages the first to adopt a pure 
factorial code is quite subtle. The result shown in fig- 
ure 10 will lead to the first stage producing an output 
in which 2 code indices (one for each object) typically 
have probability 4 of being sampled, and all of the re- 
maining code indices have a very small probability (this 
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Figure 8: Three alternative ways of using 30 code indices to 
encode a pair of correlated variables. The typical code cells 
are shown in bold. 
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Figure 9: A factorial encoder emerges when a single stage 
encoder is trained on data that is a superposition of 2 objects 
in correlated locations. 


Figure 10: The factorial encoder is improved, by the removal 
of the joint encoding contamination, when a 2-stage encoder 
is used. 


is an approximation which ignores the fact that the code 
cells overlap). On the other hand, figure 9 will lead to 
an output in which the probability is sometimes concen- 
trated on a single code index. However, the contribution 
of the second stage to the overall objective function en- 
courages it to encode the vector of probabilities output 
by the first stage with minimum Euclidean reconstruc- 
tion error, which is easier to do if the situation is as in 
figure 10 rather than as in figure 9. In effect, the second 
stage likes to see an output from the first stage in which 
a large number of code indices are each sampled with a 
low probability, which favours factorial coding over joint 
encoding. 
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Figure 11: A joint encoder emerges when a single stage en- 
coder is trained on data that is a superposition of 2 objects 
in correlated locations. 


2. Joint Encoding 


A joint encoder may be trained by setting the param- 
eter values as follows: code book size M = 16, number 
of code indices sampled n = 3, ¢ = 0.2 for 500 training 
steps, ¢ = 0.1 for a further 500 training steps, « = 0.05 
for a further 1000 training steps. This is the same as 
the parameter values for the factorial encoder above, ex- 
cept that n has been reduced to n = 3, and the training 
schedule has been extended. 

The result is shown in figure 11. After some initial 
confusion, the reconstruction vectors self-organise so that 
each code index corresponds to a pair of objects at well 
defined locations, so the code index jointly encodes the 
pair of object positions; this is a joint encoder. The small 
value of n prevents a factorial encoder from emerging. 


8. Invariant Encoding 


An invariant encoder may be trained by using a 2-stage 
encoder, and setting the parameter values identically in 
each stage as follows (where the weighting of the second 
stage relative to the first is denoted as s): code book size 
M = 16, number of code indices sampled n = 3, ¢ = 0.2 
and s = 5 for 500 training steps, « = 0.1 and s = 10 
for a further 500 training steps, ¢ = 0.05 and s = 20 for 
a further 500 training steps, « = 0.05 and s = 40 for a 
further 500 training steps. This is basically the same as 
the parameter values used for the joint encoder above, 
except that there are now 2 stages, and the weighting 
of the second stage is progressively increased throughout 
the training schedule. Note that the large value that is 
used for s is offset to a certain extent by the fact that 
the ratio of the normalisation of the inputs to the first 
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Figure 12: An invariant encoder emerges when 2-stage en- 
coder is trained on data that is a superposition of 2 objects 
in correlated locations. 


and second stages is very large; the anomalous normal- 
isation of the input to the first stage could be removed 
by insisting that the input to the first stage is a vector of 
probabilities, but that is not done in these simulations. 

The result is shown in figure 12. During the early 
part of the training schedule the weighting of the sec- 
ond stage is still relatively small, so it has the effect of 
turning what would otherwise have been a joint encoder 
into a factorial encoder; this is analogous to the effect ob- 
served when figure 9 becomes figure 10. However, as the 
training schedule progresses the weighting of the second 
stage increases further, and the reconstruction vectors 
self-organise so that each code index corresponds to a 
pair of objects with a well defined centroid but indeter- 
minate separation. Thus each code index encodes only 
the centroid of the pair of objects and ignores their sep- 
aration. This is a new type of encoder that arises when 
the objects are correlated, and it will be called an invari- 
ant encoder, in recognition of the fact that its output is 
invariant with respect to the separation of the objects. 

Note that in these results there is a large amount of 
overlap between the code cells, which should be taken 
into account when interpreting the illustration in figure 
8. 


IV. CONCLUSIONS 


The numerical results presented in this paper show 
that a stochastic vector quantiser (VQ) can be trained 
to find a variety of different types of way of encoding 
high-dimensional input vectors. These input vectors are 
generated in two stages. Firstly, a low-dimensional man- 
ifold is created whose intrinsic coordinates are the po- 
sitions of the objects in the scene; this corresponds to 
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generating the scene itself. Secondly, this manifold is 
non-linearly embedded to create a curved manifold that 
lives in a high-dimensional space of image pixel values; 
this corresponds to imaging the generated scene. 

Three fundamentally different types of encoder have 
been demonstrated, which differ in the way that they 
build a reconstruction that approximates the input vec- 
tor: 


1. A factorial encoder uses a reconstruction that is su- 
perposition of a number of vectors that each lives 
in a well defined input subspace, which is useful 
for discovering constituent objects in the input vec- 
tor. This result is a type of independent component 
analysis (ICA) [4]. 


2. A joint encoder uses a reconstruction that is a sin- 
gle vector that lives in the whole input space. This 
result is basically the same as what would be ob- 
tained by using a standard VQ [7]. 


3. An invariant encoder uses a reconstruction that is 
a single vector that lives in a subspace of the whole 
input space, so it ignores some dimensions of the 
input vector, which is therefore useful for discover- 
ing correlated objects whilst rejecting uninteresting 
fluctuations in their relative coordinates. This is 
similar to self-organising transformation invariant 
detectors described in [12]. 


More generally, the encoder will be a hybrid of these basic 
types, depending on the interplay between the statistical 
properties of the input vector and the parameter settings 
of the SVQ. 
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Appendix A: Derivatives of the Objective Function 


In order to minimise D; + Dz it is necessary to com- 
pute its derivatives. The derivatives were presented in 
detail in [9] for a single stage chain (i.e. a single FMC). 
The purpose of this appendix is to extend this deriva- 
tion to a multi-stage chain of linked FMCs. In order 
to write the various expressions compactly, infinitesi- 
mal variations will be used thoughout this appendix, so 


that d(uv) = duv +udv will be written rather than 
O(uv) _ Au 


a ag U tu ue (for some parameter 0). The calcu- 
lation will be done in a top-down fashion, differentiating 
the objective function first, then differentiating anything 
that the objective function depends on, and so on follow- 
ing the dependencies down until only constants are left 
(this is essentially the chain rule of differentiation). 

The derivative of D; + D2 (defined in equation 2.10) 
is given by 
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6S>s (DP + DP) => s (sD? + DP) (a1) 


l=1 SI 


The derivatives of the Dv? and DY parts (defined in 


M 


SD, = = f dx Pr(x) > (SPr(ylx) lx — x! (ll? + 2Pr (ul) (5x-5x" (y)) x’ (y))) 


ysl 
M 


11 


equation 2.7, with appropriate (/) superscripts added) 
of the contribution of stage | to D, + D2 are given by 
(dropping the (1) superscripts again, for clarity) 


(A2) 


4(n—1) a ! hed ! ted a ’ Pod 
6D, = ata [dx Pro) bx — S> (5 Pr (y'|x) x! (y’) + Pr (y'|x) 6x" (y')) | - | x — $2 Pr (y'Ix) x’ (y') 


y’=1 


In numerical simulations the exact piecewise linear solu- 
tion for the optimum Pr (y|x) (see section IIC) will not 
be sought, rather Pr (y|x) will be modelled using a simple 
parametric form, and then the parameters will be opti- 
mised. This model of Pr (y|x) will not in general include 
the ideal piecewise linear optimum solution, so using it 
amounts to replacing D, + D2, which is an upper bound 
on the objective function D (see equation 2.10), by an 
even weaker upper bound on D. The justification for 
using this approach rests on the quality of the results 
that are obtained from the resulting numerical simula- 
tions (see section III). 

The first step in modelling Pr (y|x) is to explicitly state 


6 Pr (y|x) 1 
Pr (y[x) 


Q (ylx) 


= 5Q (ylx) — Pr (ylx) 5> 6Q (y/|x) 


the fact that it is a probability, which is a normalised 
quantity. This may be done as follows 


Q (ylx) 


Pr (y|x) = ————4@ 
as art 


(A3) 


where Q(y|x) > 0 (note that there is a slight change 
of notation compared with [9], because Q (y|x) rather 
than Q (x|y) is written, but the results are equivalent). 
The Q(y|x) are thus unnormalised probabilities, and 


ae Q (y’|x) is the normalisation factor. The deriva- 


tive of Pr (y|x) is given by 


M 
(A4) 


y’=1 


The second step in modelling Pr (y|x) is to introduce an explicit parameteric form for Q(y|x). The following 


sigmoidal function will be used in this paper 


Q (ylx) 


where w (y) is a weight vector and b(y) is a bias. The derivative of Q (y|x) is given by 


1 
irre = 00) ee) 
5Q (ylx) = Q (ylx) (1 — @ (ylx)) (6w (y) x + w (y) «x + 66(y)) (A6) 


There are also 6x derivatives in equation A2 and equation 
A6. The 6x derivative arises only in multi-stage chains 
of FMCs, and because of the way in which stages of the 
chain are linked together (see equation 2.9) it is equal to 
the derivative of the vector of probabilities output by the 
previous stage. Thus the 6x derivative may be obtained 
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by following its dependencies back through the stages of 
the chain until the first layer is reached; this is essen- 
tially the chain rule of differentiation. This ensures that 
for each stage the partial derivatives include the addi- 
tional contributions that arise from forward propagation 
through later stages, as described in section III A. 


12 


There are also 6x’ (y) derivatives in equation A2, but 
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In this paper a stochastic generalisation of the standard Linde-Buzo-Gray (LBG) approach to 
vector quantiser (VQ) design is presented, in which the encoder is implemented as the sampling 
of a vector of code indices from a probability distribution derived from the input vector, and the 
decoder is implemented as a superposition of reconstruction vectors. This stochastic VQ (SVQ) is 
optimised using a minimum mean Euclidean reconstruction distortion criterion, as in the LBG case. 
Numerical simulations are used to demonstrate how this leads to self-organisation of the SVQ, where 
different stochastically sampled code indices become associated with different input subspaces. 


I. INTRODUCTION 


In vector quantisation a code book is used to encode 
each input vector as a corresponding code index, which 
is then decoded (again, using the codebook) to produce 
an approximate reconstruction of the original input vec- 
tor [2, 3]. The purpose of this paper is to generalise the 
standard approach to vector quantiser (VQ) design [7], 
so that each input vector is encoded as a vector of code 
indices that are stochastically sampled from a probabil- 
ity distribution that depends on the input vector, rather 
than as a single code index that is the deterministic out- 
come of finding which entry in a code book is closest 
to the input vector. This will be called a stochastic 
VQ (SVQ), and it includes the standard VQ as a spe- 
cial case. Note that this approach is different from the 
various stochastic approches that are used to train VQs 
(see e.g. [12, 14, 15]), because here the codebook itself 
is stochastic, so the use of probability distributions is 
essential both during and after training. 


One advantage of using the stochastic approach, which 
will be demonstrated in this paper, is that it automates 
the process of splitting high-dimensional input vectors 
into low-dimensional blocks before encoding them, be- 
cause minimising the mean Euclidean reconstruction er- 
ror can encourage different stochastically sampled code 
indices to become associated with different input sub- 
spaces [11]. Another advantage is that it is very easy to 
connect SVQs together, by using the vector of code index 
probabilities computed by one SVQ as the input vector 
to another SVQ [10]. 


In Section II various pieces of previously published the- 
ory are unified to give a coherent account of SVQs. In 
Section ITI the results of some new numerical simulations 
are presented, which demonstrate how the code indices in 
a SVQ can become associated in various ways with input 
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subspaces. In the appendices various derivations relating 
to the detailed training of an SVQ are presented. 


II. THEORY 


In this section various pieces of previously published 
theory are unified to establish a coherent framework for 
modelling SVQs. In Section IIA the basic theory of 
folded Markov chains (FMC) is given [8], and in Section 
ITB it is extended to the case of high-dimensional input 
data [9]. Finally, in Section IIC the theory is further 
generalised to chains of linked FMCs [10]. 


A. Folded Markov Chains 


The basic building block of the encoder/decoder model 
used in this paper is the folded Markov chain (FMC) [8]. 
Thus an input vector a is encoded as a code index vector 
y, which is then subsequently decoded as a reconstruction 
x’ of the input vector. Both the encoding and decoding 
operations are allowed to be probabilistic, in the sense 
that y is asample drawn from Pr(y|a), and a’ is a sample 
drawn from Pr(a’|y), where Pr(y|a) and Pr(a’|y) are 
Bayes’ inverses of each other, as given by Pr(a’|y) = 

Pr(y|x) Pr(a) 

Jf dz Pr(ylz) Pr(z) 
which xz was sampled. Because the chain of dependences 
in passing from « to y and then to 2’ is first order Markov 
(i.e. it is described by the directed graph (a —> y —> 
zx’), and because the two ends of this Markov chain (i.e. 
ax and 2’) live in the same vector space, it is called a 
folded Markov chain (FMC). The operations that occur 


in an FMC are summarised in Figure 1. 


, and Pr(a) is the prior probability from 


probability 
encode = 


code 


Figure 1: A folded Markov chain (FMC) in which an input 
vector a is encoded as a code index vector y that is drawn 
from a conditional probability Pr(y|x), which is then decoded 
as a reconstruction vector x’ drawn from the Bayes’ inverse 
conditional probability Pr(a’|y). 


yr=ly2=1 Yn=1 


D= pe Pr(a) 


where y= (is Youths Yn) 1 < Yi < M is assumed, 
Pr(x) Pr(y|a) Pr(a’|y) is the joint probability that the 
FMC has state (2, y, a’), ||a — a’||” is the Euclidean re- 
M M M 
construction error, and [dx S> >> --- ) fda’ (---) 
yi=ly2=l Yyn=1 

sums over all possible states of the FMC (weighted by 
the joint probability). 

The Bayes’ inverse probability Pr(x’|y) may be inte- 
grated out of this expression for D to yield 


M M M 
He 2 [ de Pr(a) do YS Pr(vle) [he — 2/1? 


yi=1y2=1 Yyn=l 
(2) 


where the reconstruction vector «’(y) is defined as 
x'(y) = [da Pr(a|y) a. Because of the quadratic form 
of the objective function, it turns out that x’(y) may be 
treated as a free parameter whose optimum value (i.e. 
the solution of wots) = 0) is [dx Pr(ax) x, as required. 
It was shown in [8] that the standard VQ [7] and to- 
pograpic mappings [5] automatically emerge as special 
cases when D is minimised. In this approach, topo- 
graphic mappings emerge as the optimal coding scheme 
when the code is to be transmitted along a noisy com- 
munication channel before being decoded [1, 6]. 


B. High Dimensional Input Spaces 


A problem with the standard VQ is that its code book 
grows exponentially in size as the dimensionality of the 
input vector is increased, assuming that the contribution 
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In order to ensure that the FMC encodes the input vec- 
tor optimally, a measure of the reconstruction error must 
be minimised. There are many possible ways to define 
this measure, but one that is consistent with many pre- 
vious results, and which also leads to many new results, 
is the mean Euclidean reconstruction error measure D, 
which is defined as 


Pr(yler) / deo! Pr(a'ly) [le — «||? 


to the reconstruction error from each input dimension is 
held constant. This means that such VQs are useless for 
encoding extremely high dimensional input vectors, such 
as images. The usual solution to this problem is to man- 
ually partition the input space into a number of lower 
dimensional subspaces, and then to encode each of these 
subspaces separately. However, it would be very useful 
if this partitioning could be done automatically, in such 
a way that typically the correlations within each sub- 
space were much stronger than the correlations between 
subspaces, so that the subspaces were approximately sta- 
tistically independent of each other. The purpose of this 
paper is to present a solution to this problem. 

The key step in solving this problem is to constrain 
the minimisation of D in such a way as to encourage the 
formation of code schemes in which each component of 
the code vector y codes a different subspace of the input 
vector x. There are two related constraints that may be 
imposed on Pr(y|x) and x’(y) which may be summarised 
as 


Pr(y|) = Pr(yil@) Pr(yala) --- Pr(yn|a) 
lt (3) 
a'(y) = fi S>2'(yi) 
i=1 
Thus each component y; (for i = 1,2,---,n andl < 


yi < M) is an independent sample drawn from the code- 
book using Pr(y;|a@) (which is assumed to be the same 
function for all i), and the reconstruction vector x’(y) 
(vector argument) is assumed to be a superposition of n 
contributions x’(y;) (scalar argument) for 7 = 1,2,--- ,n. 
Taken together, these constraints encourage the forma- 
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Figure 2: A chain of linked FMCs, in which the output from each stage is its vector of posterior probabilities (for all values of 
the code index), which is then used as the input to the next stage. Only 3 stages are shown, but any number may be used. 
More generally, any acyclically linked network of FMCs may be used. 


tion of coding schemes in which independent subspaces 
are separately coded, as required. 

The constraints in Equation 3 prevent the full space of 
possible values of Pr(y|a) or x’ (y) from being explored as 
D is minimised, so they lead to an upper bound D, + Dz 
on the FMC objective function D (i.e. D < D, + Dg), 
which may be derived as [9] 


M 
D, =; f da Pr(a) yy Pr(y|@) [|x — 2'(y)|I" 
oo 2 
Dz = 2) f dar Pr(ax) 


M 
oD ENE) ea) 


(4) 
Note that M (size of codebook) and n (number of sam- 
ples drawn from codebook using Pr(y|x)) are effectively 
model order parameters, whose values need to be cho- 
sen appropriately for each encoder optimisation problem. 
The properties of the optimum solution depend critically 
on the interplay between the statistical properties of the 
training data and the model order parameters M and n, 
as will be seen in the simulations in Section III. 


C. Chains of Linked FMCs 


The FMC illustrated in Figure 1 may be generalised to 
a chain of linked FMCs as shown in Figure 2. Each stage 
in this chain is an FMC of the type shown in Figure 
1, and the vector of probabilities (for all values of the 
code index) computed by each stage is used as the input 
vector to the next stage; there are other ways of linking 
the stages together, but this is the simplest possibility. 
The overall objective function is a weighted sum of the 
FMC objective functions derived from each stage. The 
total number of free parameters in an LD stage chain is 
3D —1, which is the sum of 2 free parameters for each of 
the L stages, plus L — 1 weighting coefficients; there are 
£—1 rather than L weighting coefficients because the 
overall normalisation of the objective function does not 
affect the optimum solution. 

The chain of linked FMCs may be expressed mathe- 
matically by first of all introducing an index / to allow 
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different stages of the chain to be distinguished thus 


MM 

x — a) 

y ey 

a a! 

Dp =D 

D, — DY 

Dz — D® 

The stages are then defined and linked together thus 


ge) + y = g! 


eee) ia Cee aoe, Ae Oy) (6) 
git) = Pr(y as if) 1<i< MO 


The objective function and its upper bound are then 
given by 
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where s) > 0 is the weighting that is applied to the 
contribution of stage / of the chain to the overall objective 
function. 


III. SIMULATIONS 


In this section the results of various simulations are 
presented, which demonstrate some of the types of self- 
organising behaviour exhibited by an encoder that con- 
sists of a chain of linked FMCs. Synthetic, rather than 
real, training data are used in all of the simulations, 
because this allows the basic types of behaviour to be 
cleanly demonstrated. 

In Section III A the training data is described. In Sec- 
tion IIIB a single stage encoder is trained on data that 


is a superposition of two randomly positioned objects. 
In Section IIIC this is generalised to objects with corre- 
lated positions, and three different types of behaviour are 
demonstrated: factorial encoding using both a l1-stage 
and a 2-stage encoders (Section III D), joint encoding us- 
ing a l-stage encoder (Section IIIE), and invariant en- 
coding using a 2-stage encoder (Section III F). 

In Appendix A the derivatives of the objective function 
are derived, and in Appendix B a gradient descent train- 
ing algorithm based on these derivatives is presented. 


A. Training Data 


The key property that this type of self-organising en- 
coder exhibits is its ability to automatically split up high- 
dimensional input spaces into lower-dimensional sub- 
spaces, each of which is separately encoded. This self- 
organisation manifests itself in many different ways, de- 
pending on the interplay between the statistical prop- 
erties of the training data, and the 3 free parameters 
(ie. the code book size M, the number of code indices 
sampled n, and the stage weighting s) per stage of the 
encoder (see Section ITC). 

In order to demonstrate the various different basic 
types of self-organisation it is necessary to use synthetic 
training data with controlled properties. All of the types 
of self-organisation that will be demonstrated in this pa- 
per may be obtained by training a l-stage or 2-stage en- 
coder on 24-dimensional data (i.e. M = 24) that con- 
sists of a superposition of a pair of identical objects (with 
circular wraparound to remove edge effects), such as is 
shown in Figure 3. 
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10 15 20 
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Figure 3: An example of a typical training vector for M = 24. 
Each object is a Gaussian hump with a half-width of 1.5 units, 
and peak amplitude of 1. The overall input vector is formed 
as a linear superposition of the 2 objects. Note that the input 
vector is wrapped around circularly to remove minor edge 
effects that would otherwise arise. 


In the simulations presented below, two different meth- 
ods of selecting the object positions are used: either the 
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positions are statistically independent, or they are corre- 
lated. In the independent case, each object position is a 
random integer in the interval [1,24]. In the correlated 
case, the first object position is a random integer in the 
interval [1,24], and the second object position is chosen 
relative to the first one as an integer in the range [4, 8], 
so that the mean object separation is 6 units. 


B. Independent Objects 


The simplest demonstration is to let a single stage en- 
coder discover the fact that the training data consists 
of a superposition of a pair of objects, which is a type 
of independent component analysis (ICA) [4]. This may 
readily be done by setting the parameter values as fol- 
lows: code book size M = 16, number of code indices 
sampled n = 20, ¢ = 0.2 for 500 training steps, ¢ = 0.1 
for a further 500 training steps. 


Figure 4: A factorial encoder emerges when a single stage 
encoder is trained on data that is a superposition of 2 objects 
in independent locations. 


The self-organisation of each of the 16 reconstruction 
vectors as training progresses (measured down the page) 
is shown in Figure 4. After some initial confusion, the re- 
construction vectors self-organise so that each code index 
corresponds to a single object at a well defined location, 
whose width automatically adjusts itself so that the M 
reconstruction vectors cover the whole input space. This 
behaviour is non-trivial, because each training vector is 
a superposition of a pair of objects at independent loca- 
tions, so two different code index values must be sampled 
by the encoder (assuming that the two objects are not 
at the same location); the relatively large choice n = 20 
ensures that it is highly likely that both code index val- 
ues will be amongst the n random samples [11]. This 
result is called a factorial encoder, because the objects 
are encoded separately. 
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The case of a joint encoder, where each code index 
corresponds to a pair of objects at well defined locations, 
requires a rather large code book when the objects are 
independent. However, when correlations between the 
objects are introduced then the code book can be reduced 
to a manageable size, as will be demonstrated in the next 
section. 


C. Correlated Objects 


If the positions of the pair of objects are mutually cor- 
related, then they can be encoded in 3 fundamentally 
different ways: 


1. Factorial encoder. This encoder ignores the corre- 
lations between the objects, and encodes them as 
if they were 2 independent objects. Each code in- 
dex thus encodes a single object position, so many 
code indices must be sampled in order to virtually 
guarantee that both object positions are encoded 
[11]. This result is a type of independent compo- 
nent analysis (ICA) [4]. 


2. Joint encoder. This encoder regards each possible 
joint placement of the 2 objects as a distinct con- 
figuration. Each code index thus encodes a pair of 
object positions, so only one code index needs to 
be sampled in order to guarantee that both object 
positions are encoded [11]. This result is basically 
the same as what would be obtained by using a 
standard VQ [7]. 


3. Invariant encoder. This encoder regards each pos- 
sible placement of the centroid of the 2 objects as 
a distinct configuration, but regards all possible 
object separations (for a given centroid) as being 
equivalent. Each code index thus encodes only the 
centroid of the pair of objects. This type of encoder 
does not arise when the objects are independent. 
This is similar to self-organising transformation in- 
variant detectors described in [13]. 


i 
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invariant 


factorial 


Figure 5: Three alternative ways of using 30 code indices to 
encode a pair of correlated variables. The typical code cells 
are shown in bold. 


Each of these 3 possibilities is shown in Figure 5, where 
the diagrams are meant only to be illustrative. The cor- 
related variables live in the large 2-dimensional rectan- 
gular region extending from bottom-left to top-right of 
each diagram. 
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The factorial encoder has two orthogonal sets of long 
thin rectangular code cells, and the diagram shows how 
a pair of such cells intersect to define a small square code 
cell. The joint encoder behaves as a standard vector 
quantiser, and is illustrated as having a set of square 
code cells, although their shapes will not be as simple as 
this in practice. The invariant encoder ideally has a set 
of long thin rectangular code cells that encode only the 
long diagonal dimension. 

In all 3 cases there is overlap between code cells. In 
the case of the factorial and joint encoders the overlap 
tends to be only between nearby code cells, whereas in 
the case of an invariant encoder the range of the overlap 
is usually much greater, as will be seen in the numerical 
simulations below. In practice the optimum encoder may 
not be a clean example of one of the types illustrated in 
Figure 5, as will also be seen in the numerical simulations 
below. 


D. Factorial Encoding 


A factorial encoder may be trained by setting the pa- 
rameter values as follows: code book size M = 16, num- 
ber of code indices sampled n = 20, ¢ = 0.2 for 500 
training steps, ¢ = 0.1 for a further 500 training steps. 


Figure 6: A factorial encoder emerges when a single stage 
encoder is trained on data that is a superposition of 2 objects 
in correlated locations. 


The result is shown in Figure 6 which should be com- 
pared with the result for independent objects in Figure 4. 
The presence of correlations degrades the quality of this 
factorial code relative to the case of independent objects. 
The contamination of the factorial code takes the form 
of a few code indices which respond jointly to the pair of 
objects. 

The joint coding contamination of the factorial code 
can be reduced by using a 2-stage encoder, in which the 


second stage has the same values of M and n as the 
first stage (although identical parameter values are not 
necessary), and (in this case) both stages have the same 
weighting in the objective function (see Equation 7). 


Figure 7: The factorial encoder is improved, by the removal 
of the joint encoding contamination, when a 2-stage encoder 
is used. 


The results are shown in Figure 7. The reason that the 
second stage encourages the first to adopt a pure facto- 
rial code is quite subtle. The result shown in Figure 7 
will lead to the first stage producing an output in which 
2 code indices (one for each object) each typically have 
probability 4 of being sampled, and all of the remain- 
ing code indices have a very small probability (this is an 
approximation which ignores the fact that the code cells 
overlap). On the other hand, Figure 6 will lead to an 
output in which the probability can be concentrated on 
a single code index, if it can jointly code the pair of ob- 
jects. However, the contribution of the second stage to 
the overall objective function encourages it to encode the 
vector of probabilities output by the first stage with mini- 
mum Euclidean reconstruction error, which is easier to do 
if the situation is as in Figure 7 rather than as in Figure 
6. In effect, the second stage likes to see an output from 
the first stage in which more than one code index has 
a significant probability of being sampled, which favours 
factorial coding over joint encoding. 


E. Joint Encoding 


A joint encoder may be trained by setting the param- 
eter values as follows: code book size M = 16, number 
of code indices sampled n = 3, ¢ = 0.2 for 500 training 
steps, ¢ = 0.1 for a further 500 training steps, « = 0.05 
for a further 1000 training steps. This is the same as 
the parameter values for the factorial encoder above, ex- 
cept that n has been reduced to n = 3, and the training 
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schedule has been extended. 


Figure 8: A joint encoder emerges when a single stage en- 
coder is trained on data that is a superposition of 2 objects 
in correlated locations. 


The result is shown in Figure 8. After some initial 
confusion, the reconstruction vectors self-organise so that 
each code index corresponds to a pair of objects at well 
defined locations, so the code index jointly encodes the 
pair of object positions; this is a joint encoder. The small 
value of n prevents a factorial encoder from emerging [11]. 


F. Invariant Encoding 


An invariant encoder may be trained by using a 2-stage 
encoder, and setting the parameter values identically in 
each stage as follows (where the weighting of the second 
stage relative to the first is denoted as s): code book size 
M = 16, number of code indices sampled n = 3, ¢ = 0.2 
and s = 5 for 500 training steps, « = 0.1 and s = 10 
for a further 500 training steps, ¢ = 0.05 and s = 20 for 
a further 500 training steps, « = 0.05 and s = 40 for a 
further 500 training steps. This is basically the same as 
the parameter values used for the joint encoder above, 
except that there are now 2 stages, and the weighting 
of the second stage is progressively increased throughout 
the training schedule. Note that the large value that is 
used for s is offset to a certain extent by the fact that 
the ratio of the normalisation of the inputs to the first 
and second stages is very large; the anomalous normal- 
isation of the input to the first stage could be removed 
by insisting that the input to the first stage is a vector of 
probabilities, but that is not done in these simulations. 

The result is shown in Figure 9. During the early 
part of the training schedule the weighting of the sec- 
ond stage is still relatively small, so it has the effect of 
turning what would otherwise have been a joint encoder 
into a factorial encoder; this is analogous to the effect ob- 
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Figure 9: An invariant encoder emerges when 2-stage encoder 
is trained on data that is a superposition of 2 objects in cor- 
related locations. 


served when Figure 6 becomes Figure 7. However, as the 
training schedule progresses the weighting of the second 
stage increases further, and the reconstruction vectors 
self-organise so that each code index corresponds to a 
pair of objects with a well defined centroid but indeter- 
minate separation. Thus each code index encodes only 
the centroid of the pair of objects and ignores their sep- 
aration. This is a new type of encoder that arises when 
the objects are correlated, and it will be called an invari- 
ant encoder, in recognition of the fact that its output is 
invariant with respect to the separation of the objects. 

Note that in these results there is a large amount of 
overlap between the code cells, which should be taken 
into account when interpreting the illustration in Figure 
5. This is an extreme example the second stage preferring 
an output from the first stage in which more than one 
code index has a significant probability of being sampled; 
the large amount of overlap between code cells means 
that many code indices have a significant probability of 
being sampled. 


IV. CONCLUSIONS 


The numerical results presented in this paper show 
that a stochastic vector quantiser (SVQ) can self-organise 
to find a variety of different types of way of encoding high- 
dimensional input vectors. Three fundamentally different 
types of encoder have been demonstrated, which differ in 
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the way that they build a reconstruction that approxi- 
mates the input vector: 

1. A factorial encoder uses a reconstruction that is su- 
perposition of a number of vectors that each lives in a 
well defined input subspace, which is useful for discover- 
ing constituent objects in the input vector. This result 
is a type of independent component analysis (ICA) [4]. 

2. A joint encoder uses a reconstruction that is a single 
vector that lives in the whole input space. This result is 
basically the same as what would be obtained by using a 
standard VQ [7]. 

3. An invariant encoder uses a reconstruction that is a 
single vector that lives in a subspace of the whole input 
space, so it ignores some dimensions of the input vector, 
which is therefore useful for discovering correlated ob- 
jects whilst rejecting uninteresting fluctuations in their 
relative coordinates. This is similar to self-organising 
transformation invariant detectors described in [13]. 

More generally, the encoder will be a hybrid of these 
basic types, depending on the interplay between the sta- 
tistical properties of the input vector and the parameter 
settings of the SVQ. 


Appendix A: Derivatives of the Objective Function 


In order to minimise D, + D2 it is necessary to com- 
pute its derivatives. The derivatives were presented in 
detail in [9] for a single stage chain (i.e. a single FMC). 
The purpose of this appendix is to extend this deriva- 
tion to a multi-stage chain of linked FMCs. In order 
to write the various expressions compactly, infinitesi- 
mal variations will be used thoughout this appendix, 
so that 6(uv) = pie + udv will be written rather than 
Suv) = Hutu 38 (for some parameter 0). The calcu- 
lation will be done in a top-down fashion, differentiating 
the objective function first, then differentiate anything 
that the objective function depends on, and so on follow- 
ing the dependencies down until only constants are left 
(this is essentially the chain rule of differentiation). 

The derivative of D; + D2 (defined in Equation 7) is 
given by 


L L 
6308 (DP +DP) = >s (sv? + D2) 
l=1 l=1 
(Al) 
The derivatives of the Do and Do parts (defined in 
Equation 4, with appropriate (J) superscripts added) of 
the contribution of stage | to D, + D2 are given by (drop- 
ping the (1) superscripts again, for clarity) 
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5D, = = | de Px Pr(ae) }> (dPr(yla) |lar — 2 (y)||? + 2Pr(yla) (Sa — da'(y))) . (@ — 2"(y)) 
M 
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The first step in modelling Pr(y|x) is to explicitly state the fact that it is a probability, which is a non-negative 
normalised quantity. This may be done as follows 


Pr(y|z) = —2Wl®) (A3) 


M 
where Q(y|x) > 0. The Q(y|x) are unnormalised probabilities, and 5> Q(y’|ax) is the normalisation factor. The 
y=1 


derivative of Pr(y|a) is given by 


OPr(yjz) 1 ae odaies M os 
Pr(y|z)  Q(y|a) 5 Q(ylax) — Pr(y| ) Dd 5aty' ) (A4) 


The second step in modelling Pr(y|a) is to introduce an explicit parameteric form for Q(y|x). The following 
sigmoidal function will be used in this paper 


1 


y|x A5 
OU) = TF ep wa) @ — oa) Soe 
where w(y) is a weight vector and b(y) is a bias. The derivative of Q(y|a) is given by 

dQ(ylx) = Q(yla) (1 — Q(y|ax)) (Gw(y).x + w(y).da + 6b(y)) (A6) 

I 

This has reduced the 6D, and 6D, derivatives to 6w(y), Appendix B: Training Algorithm 
db(y), da’(y) and 6a derivatives. The dw(y), b(y) and 
62x’ (y) derivatives relate directly to the parameters being Assuming that Pr(y|a) is modelled as in appendix 
optimised and thus need no further simplification, how- A (ie Pr(yla) = Q(ylx) and Q(y|x) = 
ever the 6a derivatives in Equation A2 and Equation A6 2 7 > Q(y'|x) . _ 
need some further attention. The da derivative arises y/=1 


only in multi-stage chains of FMCs, and because of the EST ETO SOE then the partial derivatives of Dy + 
way in which stages of the chain are linked together (see Dz with respect to the 3 types of parameters in a single 
Equation 6) it is equal to the derivative of the vector stage of the encoder may be denoted as 

of probabilities output by the previous stage. Thus the 


da derivative may be obtained by following its depen- Iwly) = ae 

dencies back through the stages of the chain until the = (DitD2) Bl 
: ae : : gly) = ably) (B1) 

first layer is reached; this is essentially the chain rule of — O(Di+D2) 

differentiation. This ensures that for each stage the par- 92(y) = Ox'(y) 


tial derivatives include the additional contributions that 
arise from forward propagation through later stages, as 
described in Appendix B. 


This may be generalised to each stage of a multi-stage en- 
coder by including an (/) superscript, and ensuring that 
for each stage the partial derivatives include the addi- 
tional contributions that arise from forward propagation 
through later stages; this is essentially an application 


of the chain rule of differentiation, using the derivatives 
aalt) 


Aw!) (y J 
appendix A). 


and 


rarer to link the stages together (see 
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A simple algorithm for updating these parameters is 
(omitting the (J) superscript, for clarity) 


w(y) —> w(y) — «Se 
b(y) —> bly) — ew (B2) 
w'(y) —+ af(y) — efelt) 


where ¢€ is a small update step size parameter, and the 
three normalisation factors are defined as 


— max /jlg,(@Il? 
Jw, = y dim x 
max 
$0 = y lo(y)| (B3) 
— max /|ig,(y)II 
Jzx,0 = y dim x 


The ae and aa factors ensure that the maximum 
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Invariant Stochastic Encoders 
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S P Luttrellt 
DERA, St Andrews Road, Malvern, Worcs, WR14 3PS, UK 


The theory of stochastic vector quantisers (SVQ) has been extended to allow the quantiser to 
develop invariances, so that only “large” degrees of freedom in the input vector are represented in 
the code. This has been applied to the problem of encoding data vectors which are a superposition 
of a “large” jammer and a “small” signal, so that only the jammer is represented in the code. 
This allows the jammer to be subtracted from the total input vector (i.e. the jammer is nulled), 
leaving a residual that contains only the underlying signal. The main advantage of this approach 
to jammer nulling is that little prior knowledge of the jammer is assumed, because these properties 
are automatically discovered by the SVQ as it is trained on examples of input vectors. 


I. INTRODUCTION 


In vector quantisation a code book is used to encode 
each input vector as a corresponding code index, which is 
then decoded (again, using the codebook) to produce an 
approximate reconstruction of the original input vector 
[3, 4]. The standard approach to vector quantiser (VQ) 
design [5] may be generalised [7] so that each input vector 
is encoded as a vector of code indices that are stochas- 
tically sampled from a probability distribution that de- 
pends on the input vector, rather than as a single code 
index that is the deterministic outcome of finding which 
entry in a code book is closest to the input vector. This 
will be called a stochastic VQ (SVQ), and it includes the 
standard VQ as a special case. 


One advantage of using the stochastic approach is that 
it automates the process of splitting high-dimensional 
input vectors into low-dimensional blocks before encod- 
ing them, because minimising the mean Euclidean re- 
construction error can encourage different stochastically 
sampled code indices to become associated with different 
input subspaces [8, 10]. Another advantage is that it is 
very easy to connect SVQs together, by using the vector 
of code index probabilities computed by one SVQ as the 
input vector to another SVQ [9]. 


SVQ theory will be extended to the case of encoding 
noisy (or distorted) data, with the intention of subse- 
quently reconstructing an approximation to the noiseless 
data. This theory is then applied to the problem of en- 
coding data vectors which are a superposition of a “large” 
jammer and a “small” signal, where the signal is regarded 
as a distortion superimposed on the jammer, rather than 
the other way around. The reconstruction is then an ap- 
proximation to the jammer, which can thus be subtracted 
from the original data to reveal the underlying signal of 
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interest. 

In Section II the underlying theory of SVQs is devel- 
oped together with its extension to the encoding of noisy 
data, and in Section III some simulations illustrating the 
application of SVQs to the nulling of jammers are pre- 
sented. 


II. STOCHASTIC VECTOR QUANTISER 
THEORY 


In Section IT A the basic theory of folded Markov chains 
(FMC) is given [6], in Section IIB FMC theory is ex- 
tended to the case of encoding noisy or distorted data 
with the intention of eventually recovering the undis- 
torted data, in Section ITC this extended theory is ap- 
plied to the problem of encoding data that contain un- 
wanted “nuisance degrees of freedom’, in Section ITD 
some constraints (including the threshold trick of [11]) 
on the optimisation of the encoder are introduced to en- 
courage the encoder to disregard the nuisance degrees of 
freedom (i.e. discover invariances), and finally in Section 
ITE this invariant encoder theory is applied to the prob- 
lem of encoding and subsequently nulling “large” jammers 
that obscure “small” signals. 


A. Folded Markov Chains 


The basic building block of the SVQ used in this paper 
is the folded Markov chain (FMC) [6]. An input vector 
az is encoded as a code index vector y, which is then 
subsequently decoded as a reconstruction x’ of the input 
vector. Both the encoding and decoding operations are 
allowed to be probabilistic, in the sense that y is a sam- 
ple drawn from Pr(y|x), and x’ is a sample drawn from 
Pr(z’|y), where Pr(y|a) and Pr(a’|y) are Bayes’ inverses 


of each other, as given by Pr(a’|y) = ee 


and Pr(a) is the prior probability from which a is sam- 
pled. 

In order to ensure that the FMC encodes the input vec- 
tor optimally, a measure of the reconstruction error must 
be minimised. There are many possible ways to define 


probability 
encode i] 


ple 


code 


Figure 1: A folded Markov chain (FMC) in which an input 
vector x is encoded as a code index vector y that is drawn 
from a conditional probability Pr(y|x), which is then decoded 
as a reconstruction vector x’ drawn from the Bayes’ inverse 
conditional probability Pr(a’|y). 


M M M 
da Pr(x) S~ Soe 


yr=ly2=1 Yn=1 
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where y= (Yi, Ya,°** Yn) 1 Ss Yi < M is assumed, 
Pr(a) Pr(y|a) Pr(a’|y) is the joint probability that the 
FMC has state (a, y, 2’), ||a — a'||? is the Euclidean re- 


M M M 
construction error, and [dx S> So --- YO fda'(---) 
yi=ly2=l Yyn=1 
sums over all possible states of the FMC (weighted by 
the joint probability). 
The Bayes’ inverse probability Pr(a’|y) may be inte- 
grated out of this expression for D to yield [6] 


M M M 
D=2 f de Pr(x) ¥) Yo -- YY Pr(yle) le - 2"(y)) 


yir=lyo=1 Yn=1 
(2) 


where the reconstruction vector «’(y) is defined as 
x'(y) = {da Pr(a|y) a. Because of the quadratic form 
of the objective function, it turns out that ax’(y) may 
be treated as a free parameter whose optimum value (i.e. 
the solution of 52? ~ = 0) is f dx Pr(a|y) a, as required. 


Ox’ (y) 


2 
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B. Noisy Data 


The FMC approach can be generalised to the problem 
of encoding noisy or distorted data, with the intention 
of eventually recovering the undistorted data. This gen- 
eralisation is based on the results reported in [2]. The 
input vector is 29, which is converted into the distorted 
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this measure, but one that is consistent with many pre- 
vious results, and which also leads to many new results, 
is the mean Euclidean reconstruction error measure D, 


which is defined as [6] 


Pr(ylr) / deo! Pr(2'ly) [le — «||? 


input vector 2 by a distortion process Pr(a|ao9), which 
is then encoded as a code index vector y, which is then 
subsequently decoded as a reconstruction x9’ of the orig- 
inal input vector. This is described by the directed graph 
Lo —> & — y — 2&0’. The operations that occur are 
summarised in Figure 2. 


encode 


F probability 
distort r 


code 


Figure 2: A folded Markov chain (FMC) in which an input 
vector Zo is first distorted into x, which is then encoded as 
a code index vector y that is drawn from a conditional prob- 
ability Pr(y|x), which is then decoded as a reconstruction 
vector Zo’ drawn from the Bayes’ inverse conditional proba- 
bility Pr(ao’|y). 


The mean Euclidean reconstruction error measure D 
becomes (compare Equation 1) 
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The Bayes’ inverse probability Pr(ao’|y) may be integrated out of this expression for D to yield (compare Equation 


2) 
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where the reconstruction vector a/(y) is defined as ao'(y) = f dao Pr(ao|y) #0, which may be treated as a free 


parameter. 


Bayes’ theorem Pr(ap) Pr(a|ao) = Pr(a) Pr(ao|a) may be used to integrate out zo to yield 


M 
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where x(a) is defined as ao(a) = [ dao Pr(ao|x) xo. 

It is much more difficult to optimise this version of 
the objective function than the version in Equation 2, 
because the ao(a) term is in general a non-linear func- 
tion of . Worse still, the expression for x9(a) involves 
Pr(ao|a), which depends on the unknown Pr(ao), so 
Xo(a#) cannot be computed analytically anyway. The 
situation looks irretrievable, but it turns out that some 
progress can be made by conceptually splitting x into 
“signal” and “noise” subspaces, as will be shown in Sec- 
tion ITC. 


M 
me Se Pr(y|a) ||ao(a) — ao! (y)||? + constant (5) 
yi=1 yo=1 Yn=l 


C. Nuisance Degrees of Freedom 


For convenience, split up the input space into (possibly 
non-orthogonal) subspaces as (a,x), where all of the 
distortion is contained in x,, which requires that any 
distortion that lies in the ao subspace is regarded as part 
of the undistorted input. The directed graph becomes 
(ao, 0) —> (ao, 1) —> y —> 2’ as shown in Figure 
3. 


The expression for D becomes (compare Equation 4). 


M M 


D=2 / dary Pr(zto) i dat, Pr(ai|0) S> D> +++ $2 Pr(ylato, a1) [leo — #o'(y)II? (6) 


yr=ly2=1 


Yn=1 


Consider the related optimisation problem in which an attempt to to reconstruct (ao, #1) is made, as shown in Figure 


| | 


4. 


reconstruct 


code 


Figure 4: Modified version of Figure 3 in which the reconstruction link is switched from the original undistorted signal to the 


full signal+distortion. 
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code 


Figure 3: A folded Markov chain (FMC) in which an input 
vector (ao, 0) is first distorted into (a0, #1), which is then 
encoded as a code index vector y that is drawn from a con- 
ditional probability Pr(y|ao, x1), which is then decoded as a 
reconstruction vector ao’ drawn from the Bayes’ inverse con- 
ditional probability Pr(ao'|y). 


The corresponding objective function may be obtained by modifying Equation 6, where the cross-term arising from 
non-orthogonal (ap, #1) is omitted. 
M M M 
D=2 f deo Pr(a0) f dey Pr(wijao) YY) S) Pr(ylao,21) (\leo- owl? + Ilex —2."(W)I’) 
yi=ly2=l Yn=1 
I 


Assume for now (to be justified below) that some of the In order to break the links as shown in Figure 5 the 
links in Figure 4 are broken as shown in Figure 5. following argument is required: 


en ee. ee ee probability 1. Assume that the encoder is independent of x1, so 
501 that Pr(y|xo, #1.) = Pr(y|ao). 


2. The ||x, — x,'(y)||? term in D needs to simplify 
probability 
| | 


reconstruct 


ple 


code 


Figure 5: Modified version of Figure 4 in which the encoder 
(and reconstruction) links from (and to) the distortion sub- 
space are deleted (as indicated by the dashed lines). 


code 


Figure 6: Equivalent version of Figure 5 in which the recon- 


Because the distortion subspace is not involved in the siruction Wn eis aneved taran-earn valent pesitions 


computations in Figure 5, it may be redrawn as shown 
in Figure 6. 

This is the same as Figure 3, except that the encoder 
now disregards (or is invariant with respect to) the nui- 
sance degrees of freedom. 3. This requires that 


to a constant. 


M M M 
f dao Pr(ao) fda, Pr(a1|ao) 32 SYD --- XD Pr(ylao) |lai — v1/(y)||? = constant. 
1 Yyn=1 
I 


4. To guarantee this constant, it is sufficient to have f de, Pr(w,|ao) la, — v1 ‘(y)||? = constant in- 
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dependent of a and y. 


5. To guarantee this constant independent of ap and 
y, it is sufficient to have Pr(a,|ao) = Pr(x_). 


6. Given that Pr(x1|x) = Pr(a#,) and 
Pr(y|ao,v1) = Pr(ylao), then making the 
replacement 21/(y) —> (a1) in D will give an 
objective function with the same stationary points 
as D, because 21/(y) = (#,) is the stationary 
point of D with respect to x,’(y). 


~ 


7. Given that  Pr(«x_|a0) - Pr(a_) 
| 
M M M 
D=2 f deo Pr(o) > pe 
yi=lye=l Yn=1 


This is the standard FMC objective function (compare 
Equation 2) for encoding and reconstructing the undis- 
torted input, for which the directed graph is #9 —> 
y —> xo’. Note that, under the stated assumptions, 
the simplification in Equation 9 occurs even if the two 
subspaces are not orthogonal to each other, the potential 
cross-term f[ da, Pr(# |x) (wo —2ao'(y)).(@1 —@1/(y)) 
in Equation 7 is zero. 

In summary, the encoder has access only to the signal 
+ distortion (a,x) (see Figure 4 and Equation 7), but 
the assumptions in Equation 8 force the encoder to dis- 
regard the distortion (see Figure 6 and Equation 9). In 
practice, it is not possible to satisfy these assumptions in 
general, because it is not known in advance how to ex- 
tract orthogonal signal and distortion subspaces (ao, x, ) 
given examples of only the distorted signal. However, 
these assumptions may be encouraged to hold true by 
minimising D (as defined in Equation 7) under certain 
constraints, in which case Figure 6 and Equation 9 fol- 
low automatically from Figure 4 and Equation 7, respec- 
tively. These constraints are discussed in Section IID. 

This type of encoder, in which the large degrees of free- 
dom are preferentially encoded, can be used as the ba- 
sis of a so-called “residual vector quantiser” [1], in which 
(quoting from [1]) “the quantiser has a sequence of encod- 
ing stages, where each stage encodes the residual (error) 
vector of the prior stage’. Note that a residual vector 
quantiser is a special case of the type of multistage en- 
coder discussed in [9]. 


D. Optimisation Constraints 


Henceforth, only the scalar case will be considered, so 
the vector y is now replaced by the scalar y (1 < y < M). 
In order to implement a practical optimisation procedure 
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5 
and = x,'(y) = (x1), the result 
fi dx, Pr(a1|ao) ||a. —a1'(y)||> = constant 
independent of ap and y follows automatically. 

The assumptions may be summarised as 
Pr(y|vo,@1) = Pr(y|ao) 
Pr(v |v) = Pr(x_,) (8) 


which allow the objective function D (see Equation 7) to 
be replaced by the equivalent objective function 


Pr(y|xo) ||%o — xo’ (y)||" + constant (9) 


for minimising D it is necessary to introduce a variety of 
assumptions and constraints. 
Because Pr(y|x) is a probability it satisfies Pr(y|a) > 0 


M 
and 5> Pr(y|a) = 1, which is guaranteed if Pr(y|x) is 
y=1 


written as 


Q(yl|a) 
M 


> Que) 


Pr(yla) = (10) 


where Q(y|x) > 0. This removes the need to explicitly 


M 
impose the constraint 5> Pr(y|a) during optimisation. 


The Q(y|x) are the unnormalised likelihoods of sampling 
code index y from the code book. 

However, Q(y|a) itself needs to be described by a finite 
number of parameters in order that the values that min- 
imise D may be derived from a finite amount of training 
data. It can be shown that the optimal form of Pr(y|a) is 
piecewise linear in x [9], and that for training data that 
lie on smooth curved manifolds the form of this solution 
is well approximated by a piecewise linear Q(y|x) of the 
form [10] 


wly).£-—a wly).«® 2a 
auie)={ Bele) wine 2b) 


which is the same as the functional form used for the 
neural response in [11]. However, the precise functional 
form of Q(y|x) needs to exhibit this behaviour only in the 
vicinity of the data manifold, so in particular it can be 
allowed to saturate (i.e. Q(y|a) —> 1) as w(y).a@ — oo. 
A convenient functional form that achieves this is the 
sigmoid, which is defined as 


(11) 


1 


Q(y|x) = 1+ exp(—w(y).x — b(y)) 


(12) 


This reduces the problem of minimising D to one of find- 
ing the optimal values of the w(y), b(y) and «’(y). This 
may be done by using the gradient descent procedure 
described in [7]. 

If the input is an undistorted signal (i.e. x = (a, 0)) 
which lies on a smooth curved manifold, then the sig- 
moids can cooperate in encoding this input as illus- 
trated in Figure 7, where the sigmoid threshold planes 
w(y).2 + b(y) = 0 are shown slicing pieces off the curved 
manifold [10]. 


Figure 7: Illustration of how a number of sigmoids can coop- 
erate to slice pieces off a signal manifold. 


The additional constraints that are required in order 
to implement the behaviour described in Section ITC will 
now be described. Thus the constraints must be such 
that the encoder disregards (or is invariant with respect 
to) the nuisance degrees of freedom x in the full input 
vector (a@, £1). However, without knowing Pr(ao, x_) 
in advance (which would allow x(a) in Equation 5 to be 
calculated), it is not possible to give a general approach 
that works in all cases. At best, an empirical approach 
must be used. 

A very simple and useful constraint is to impose 
a threshold constraint on the sigmoid function, which 
forces the value of the sigmoid to lie exactly halfway up 
its slope when the norm of its input vector is 6. This is 
achieved by choosing b(y) = —@|w(y)|, so that 


1 
We Tem emuech wel 7 


where ||w(y)|| = //w(y)-w(y) and w(y) = ee. 

If the input is a distorted signal (i.e. w = (ao, @_)) 
which lies on a “thickened” version of the smooth curved 
manifold of Figure 7 (the thickness represents the nui- 
sance degrees of freedom), then the sigmoids can cooper- 
ate in encoding this input as illustrated in Figure 8, where 
the sigmoid threshold planes w(y).a2 = 6 are shown slic- 
ing pieces off the curved manifold in a way that disregards 
the nuisance degrees of freedom. 

Note that in Figure 8 the representation of thickening 
is not complete, because it can actually occur in any di- 
rection orthogonal to the manifold, including directions 
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Figure 8: Illustration of how a number of sigmoids can coop- 
erate to slice pieces off a signal manifold thickened by nuisance 
degrees of freedom. 


orthogonal to the space in which the manifold is embed- 
ded; the radial direction in Figure 8 does not include this 
latter possibility. 

In practice, for numerical efficiency and to encourage 
the optimisation procedure to locate the global mini- 
mum of D, it is useful to introduce two additional con- 
straints. Firstly, because optimal solutions typically sat- 
isfy z’(y) © w(y) up to a multiplicative constant, each 
reconstruction vector a’(y) can be forced to lie paral- 
lel to the corresponding weight vector w(y), so that 
z'(y) x w(y); this constraint was also used in [12], but 
there it was a necessary part of the optimisation proce- 
dure, whereas here it merely encourages faster conver- 
gence. Secondly, the norm of the weight vectors ||w(y)|| 
can be constrained as ||w(y)|| = wo, in order to avoid 
situations where they grow to rather large values which 
make Q(y|a) (and hence Pr(y|a)) depend very strongly 
on a in some regions. Both of these constraints speed 
up convergence to the global minimum of D, can finally 
be lifted in the vicinity of an optimal solution to obtain 
complete convergence. 


E. Jammer Nulling 


A number of examples of typical behaviours of Pr(y|a) 
are shown in Figure 9. 


jammer 


jammer only 


Figure 9: Examples of the response of Pr(y|a) to signal and 
jammer subspaces. 


In Figure 9 the signal and jammer degrees of freedom 
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generate a pair of non-orthogonal subspaces, whose axes 
are indicated in bold. The response contours of a variety 
of possible Pr(y|a) are shown. In the “full” case a pair 
of Pr(y|a) respond to the signal and jammer subspaces 
respectively. In the “signal” case a Pr(y|a) responds to 
only the signal subspace, and is thus invariant over the 
jammer subspace. In the “jammer” case the situation 
is the reverse of the “signal” case. This argument may 
readily be generalised to any number of Pr(y|a). 

If it is assumed that the jammer is the “large” degree of 
freedom and the signal is the “small” degree of freedom, 
the signal and jammer subspaces may be separated by 
adjusting the threshold parameter @ so that in Figure 9 
the “jammer” case is obtained, in which case the Pr(y|x) 
for y = 1,2,--- , M will all become invariant over the sig- 
nal subspace. The jammer subspace is then spanned by 
the set of gradient vectors V Pr(y|x) for y = 1,2,--- ,M, 
which can thus be used to construct a projection operator 
J onto the jammer subspace, and a projection operator 
1 — J onto the signal subspace. This definition of the 
projection operator may also be used in cases where the 
jammer and signal subspaces are curved, so that the di- 
rections of their axes are functions of x, and all of the 
straight lines in Figure 9 are replaced by curves defining 
a curvilinear coordinate system and its coordinate sur- 
faces. Note that curved subspaces are the norm rather 
than the exception. 


Ill. JAMMER NULLING SIMULATIONS 


The optimisation of the encoder may be done by min- 
imising D using gradient descent [7], using the sigmoid 
function in Equation 13 to constrain the optimisation so 
that it encodes only the jammer subspace. 

In these simulations the input vector zx is 100- 
dimensional so that # = (a1, 22,--+ , 2100), and each vec- 
tor in the training set is independently generated as a 
superposition of a pair of response functions 


sin (= ) i 


05 


+ a; 


o o 


where a, is the signal amplitude that is uniformly dis- 
tributed in the interval [-V10~*, V10~°] (this corre- 
ponds to a signal level of -30dB), a; is the jammer ampli- 
tude that is uniformly distributed in the interval [—1, 1] 
(this correponds to a jammer level of 0dB), i, is the 
signal location that is chosen to be 50, 2; is the jam- 
mer location that is uniformly distributed in the interval 
[38 — A, 38+ A] (A = 0, 2,4 is used in the simulations), 
and o is the width of the response function that is chosen 
to be 2. The peak and the first zero of the sinc function 
are separated by ao, which defines the resolution cell 
size. The mean jammer position and the signal position 
satisfy is— <i; >= 12, which corresponds to a separa- 
tion of — ~~ 2 resolution cells. Random noise uniformly 
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distributed in the interval [-V10~°, V10~°] (this corre- 
ponds to a noise level of -50dB) is also added to each 
component of the training vector. 


0.10 


0.05 
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Figure 10: A two-dimensional projection of the curved mani- 
fold generated by the jammer when A = 2. 


In Figure 10 the 2-dimensional manifold generated by 
varying the jammer position over the interval [38—A, 38+ 
A] (for A = 2), and varying the jammer amplitude over 
the interval [—1,1], is shown. Because the input vector 
zx is 100-dimensional, only a low-dimensional projection 
can be visualised, and the 2-dimensional vector (249, 251) 
is displayed here. The curvilinear grid traces out the co- 
ordinate surfaces of jammer position 7; and jammer am- 
plitude a;, and the whole diagram shows how this grid 
is embedded in (#49, 251)-space. Note that the a; di- 
mension behaves as a “radial” coordinate (straight lines), 
whereas the i; dimension behaves as an “angular” coor- 
dinate (curved lines). 

In Figure 11 an encoder is trained on three different 
jammer scenarios A = 0,2,4. After training the encoder 
is tested for how well it can be used to null a pure jammer 
(ie. with no signal or noise added), where the degree of 
nulling is defined as the ratio of the squared lengths of the 
nulled input vector and the original input vector. This is 
a good test of the ability of the encoder to simultaneously 
learn the profile of the jammer and the shape of the jam- 
mer manifold which is generated by sweeping this profile 
over the interval [38 — A,38 + A]. When A = 0 there 
is a sharp minimum at the jammer location 7; = 38, as 
expected. When A = 2 the minimum becomes spread 
over the jammer locations 7; € [36,40], and when A = 4 
the minimum becomes spread even more broadly over the 
jammer locations i; € [34,42]. All of these results are as 
expected. 

In Figure 12 typical examples of an input vector to- 
gether with how it appears after jammer nulling are 
shown for each of the jammer scenarios considered in 
Figure 11. In every case the signal is clearly revealed at 
its correct location after nulling the jammer. 

In all of these training scenarios, one could envis- 
age further constraining some of the properties of the 
encoder, in order to introduce prior knowledge of the 
form of the jammer and/or signal subspaces, and to 
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Figure 11: Plot of degree of nulling against nominal jammer location, for jammer locations that are spread over the intervals 
[38, 38] using M = 2, [36,40] using M = 4, and [34, 42] using M = 6. 
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Figure 12: Plot of a typical input vector before and after jammer nulling for each of the scenarios in Figure 11. 


thereby reduce the computational complexity of the jam- 
mer nulling. For instance, the signal subspace could be 
predefined, as in conventional algorithms which hold con- 
stant the response in a predefined “look direction”. Simi- 
larly, the jammer subspace could be built out of prefined 
subspaces which are optimised so as to maximally null 
the jammer(s), as in conventional algorithms in which a 
number of jammer “templates” are used to remove the 
jammer(s). In general, by choosing appropriate addi- 
tional constraints, the SVQ approach to jammer nulling 
can be made backwardly compatible with conventional 
approaches. 


IV. CONCLUSIONS 


The theory of stochastic vector quantisers (SVQ) [7] 
has been extended to allow the quantiser to develop in- 
variances, so that only “large” degrees of freedom in the 
input vector are represented in the code. This has been 
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Using Self-Organising Mappings to Learn the Structure of Data Manifolds * 
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In this paper it is shown how to map a data manifold into a simpler form by progressively 
discarding small degrees of freedom. This is the key to self-organising data fusion, where the raw data 
is embedded in a very high-dimensional space (e.g. the pixel values of one or more images), and the 
requirement is to isolate the important degrees of freedom which lie on a low-dimensional manifold. 
A useful advantage of the approach used in this paper is that the computations are arranged as 
a feed-forward processing chain, where all the details of the processing in each stage of the chain 
are learnt by self-organisation. This approach is demonstrated using hierarchically correlated data, 
which causes the processing chain to split the data into separate processing channels, and then to 
progressively merge these channels wherever they are correlated with each other. This is the key to 


self-organising data fusion. 


I. INTRODUCTION 


The aim of this paper is to illustrate an approach that 
maps raw data into a representation that reveals its inter- 
nal structure. The raw data is a high-dimensional vector 
of sample values output by a sensor such as the sam- 
ples of a time series or the pixel values of an image, and 
the representation is typically a lower dimensional vector 
that retains some or all of the information content of the 
raw data. There are many ways of achieving this type 
of data reduction, and this paper will focus on methods 
that learn from examples of the raw data alone. 


A key approach to data reduction is the self-organising 
map (SOM) [2]. There are many variants of the SOM ap- 
proach which may be used to map raw data into a lower 
dimensional space that retains some or all of its informa- 
tion content. In order to increase the variety of mappings 
SOMs can learn some of these variants use quite sophis- 
ticated learning algorithms. For instance, the topology 
of a SOM can be learnt by the neural gas approach [9], 
or the topology of the network connecting several SOMs 
can be learnt by the growing hierarchical self-organising 
map (GHSOM) approach [1]. 


The approach used in this paper aims to achieve a 
similar type of result to the GHSOM approach. GH- 
SOM is a top-down coarse-to-fine approach to optimising 
a tree structured network of SOMs, whereas in this pa- 
per a bottom-up fine-to-coarse approach will be used that 
learns a tree structure where appropriate. The choice of a 
fine-to-coarse rather than coarse-to-fine approach is made 
in order to obtain networks that can be readily applied 
to data fusion problems, where the goal is to progres- 
sively discard noise (and irrelevant degrees of freedom) 
as the data passes along the processing chain, thus grad- 
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ually reducing its dimensionality to eventually obtain a 
low-dimensional representation of the original raw data. 

The basis for the approach used in this paper is a 
Bayesian theory of SOMs [4] in which a SOM is mod- 
elled as an encoder/decoder pair, where the decoder is 
the Bayes inverse of the encoder. In this approach the 
encoder is modelled as a conditional probability over all 
possible codes given the input, and the code that is actu- 
ally used is a single sample drawn from this conditional 
probability (i.e. a winner-take-all code). When the con- 
ditional probability is optimised to minimise the average 
Euclidean distortion between the original input and its 
reconstruction this leads to a network that has properties 
very similar to a Kohonen SOM. 

The basic approach [4] needs to be extended in two sep- 
arate ways [8]. Firstly, to encourage the self-organisation 
of a processing chain leading from raw data to a higher 
level representation, the single encoder/decoder is ex- 
tended to become a Markov chain of connected encoders, 
where each encoder feeds its output into the next encoder 
in the chain. Secondly, to encourage the self-organisation 
of each encoder into a number of separate smaller en- 
coders and thus to learn tree-structured networks where 
appropriate, each encoder is generalised to use codes that 
make simultaneous use of several samples from the condi- 
tional probability rather than only a single winner-take- 
all sample. 

The goal of the approach used in this paper is similar to 
that of the multiple cause vector quantisation approach 
[10], because the common aim is to split data into its 
separate components (or causes). However, the approach 
used in this paper aims to minimise the amount of man- 
ual intervention in the training of the network, and thus 
allow the structure of the data to determine the struc- 
ture of the network. This is made possible by using codes 
that consist of several samples from a single conditional 
probability, which allows each encoder to decide for itself 
how to split into a number of separate smaller encoders. 
Also the approach used in this paper does not make ex- 
plicit use of a generative model of the data, because the 
aim is only to map raw data into a representation that 
clarifies its internal structure (i.e. build a recognition 
model), for which a generative model may be sufficient 
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but is actually not necessary. 

This paper is organised as follows. In Section II the 
structure of data is represented as smooth curved mani- 
folds, and encoders are represented as hyperplanes that 
slice through these manifolds. In Section III the theory 
of a single encoder is developed by extending a Bayesian 
theory of SOMs [4] from winner-take-all encoders to mul- 
tiple output encoders, and this theory is further extended 
to Markov chains of connected encoders [8]. In Section 
IV these results are used to train a network on some 
hierarchically correlated data to demonstrate the self- 
organisation of a tree-structured network for processing 
the data. 


II. DATA MANIFOLDS 


In order to represent the structure of data a flexible 
framework needs to be used. In this paper an approach 
will be used in which the structure of the data manifold 
is of primary importance, and the aim is to split apart 
the manifold in such a way as to reveal how its overall 
structure is composed. This approach must take account 
of the relative amplitude of the various contributions, 
so that a high resolution representation would include 
even the smallest amplitude contributions to the man- 
ifold, and a low resolution representation would retain 
only the largest amplitude contributions. More generally, 
it would be useful to construct a sequence of representa- 
tions, each with a lower resolution than the previous one 
in the sequence. This could be achieved by progressively 
discarding the smallest degree of freedom to gradually 
lower the resolution of the representation. In effect, the 
representation will become increasingly abstract as it be- 
comes more and more invariant to the fine details of the 
original data manifold. 

In Section II A the basic notation used to describe man- 
ifolds is presented, and in Section IIB the process of 
splitting a manifold into its component pieces and then 
reassembling these to form an approximation to the man- 


ifold is described. 


A. Representation of Data Manifolds 


Assume that the raw data vector x lies on a smooth 
manifold a(w), parameterised by wu which is a vector of 
co-ordinates in the manifold. Usually, though not in- 
variably, x is a high-dimensional vector (e.g. an im- 
age comprising an array of pixel values) and u is a 
low-dimensional vector (e.g. a vector of object posi- 
tions), in which case the space in which @ lives is a 
high-dimensional embedding space for a low-dimensional 
manifold. Typically, u represents the underlying de- 
grees of freedom (e.g. object co-ordinates), whereas x 
represents the observed degrees of freedom (e.g. sen- 
sor measurements). Usually, u will contain some noise 
degrees of freedom, but these can be handled in ex- 


actly the same way as other degrees of freedom by split- 
ting u as u = (Us,U,) where u, is signal and wu, is 
noise. The probability density function (PDF) Pr(w) de- 
scribes how the manifold is populated and Pr(a) (where 
Pr(a) = f du Pr(w) 6(a@ — a(u))) describes how the em- 
bedding space is populated. 

In general, x(w) is a non-linear function of wu so the 
manifold is curved, and thus occupies more linear dimen- 
sions of the embedding space than would be the case if 
the manifold were not curved. It is commonplace for a 
1-dimensional manifold (i.e. uw is a scalar) to be curved 
so as to occupy all of the linear dimensions of the em- 
bedding space (e.g. the manifold of images generated by 
moving an object along a 1-dimensional line of positions). 


Pe 


xO O+0 


Figure 1: Examples of manifolds generated by images of a pair 
of objects. In each of the two diagrams the upper half shows 
the sensor data, and the lower half shows the low-dimensional 
manifold topology assuming that the sensor data have circular 
wraparound. The high-dimensional manifold geometry has a 
number of dimensions equal to the number of pixels in the 
corresponding sensor. (a) Tensor product of manifolds: 2- 
torus topology. This is generated by observing each object 
using a separate sensor. (b) Superposition (or mixture) of 
manifolds: This is generated by observing both objects using 
the same sensor, so that there is the possibility of overlap 
(possibly with obscuration) of the sensor data from the two 
objects. In the limit where the objects overlap infrequently 
case (b) closely approximates case (a). 


If a(w) can be written as a(u) = (x1(u1), ®2(u2)), 
where 2)(ui) and a2(u2) are independently parame- 
terised manifolds living in separate subspaces of the 
embedding space (where dima = dima, + dimag 
and dimu = dimu; + dimwue), then x(u) describes 
a tensor product of manifolds as shown in Figure 
la. This type of manifold arises when the underly- 
ing degrees of freedom are measured by separate sen- 
sors. This parameterisation can readily be gener- 
alised to a(u) = (a (u1),%2(uU2),--- , 2p (uz)) for k > 
2. For independently populated manifolds Pr(u) fac- 
torises as Pr(w) = Pr(w,) Pr(we), and if a(uw) = 
(a1 (1), @2(u2)) then Pr(a) = Pr(a1) Pr(a2) where 
Pr(a;) = f du; Pr(u,;) d(a; — 2,;(u,;)) for i = 1,2. For 
manifolds that are populated in a correlated way (i.e. 
Pr(uw) # Pr(w,) Pr(w2)) no such simple result holds. 

If a(w) can be written as a(w) = a1(u1) + %o(u2) 
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(where dima = dima, = dima), then a(u) describes 
a superposition (or mixture) of manifolds as shown in 
Figure lb. This type of manifold arises when the under- 
lying degrees of freedom are simultaneously measured by 
the same sensor. If there is little or no overlap between 
x1(u1) and x2(u2) then this is approximately equiva- 
lent to the case x(u) = (a (ui), %2(u2)) (where dima = 
dim x; + dim a2 and dimu = dim u; + dim ug). On the 
other hand, where there is a significant amount of overlap 
so that #1 (wu ,).@2(u2) > 0, there is no such correspon- 
dence. Assuming Pr(w) = Pr(wi) Pr(we) then Pr(a) 
is given by Pr(x) = f du; duz Pr(wi) Pr(us) d(a — 
z1(U}) = L2(u2)). 

More generally, av(w) can be written as a(u) = 
x(u1,U2) where x(u1, U2) has no special dependence on 
uw, and uz. Although the manifolds are independently 
parameterised by uw; and w2, when they are mapped to 
x their tensor product structure is disguised by the map- 
ping function «(u1,u2) which is usually not invertible. 
The superposition of manifolds a(u) = #1 (u1) + #2(u2) 
is a special case of this effect. 


B. Mapping of Data Manifolds 


Given examples of the raw data x how can an approx- 
imation to the mapping function (wu) be constructed? 
The detailed approach will be described in Section ITI, 
but the basic geometric ideas will be described here. The 
basic idea is to cut the manifold into pieces whilst re- 
taining only a limited amount of information about each 
piece, and then to reassemble these pieces to reconstruct 
an approximation to the manifold. This process is im- 
perfect because it is disrupted by discarding some of the 
information about each piece, so the reconstructed man- 
ifold is not a perfect copy of the original manifold. This 
loss of information is critical to the success of this pro- 
cess, because if perfect information were retained then 
there would be no need to discover a clever way of cut- 
ting the manifold into pieces, and thus no possibility of 
discovering the structure of the manifold (e.g. whether 
it is a simple tensor product). The information that is 
preserved depends on exactly how the curved manifold 
is mapped to a new representation (see Section III for 
details). 

Figure 2a shows an example of how a convex 1- 
dimensional manifold can be cut into overlapping pieces 
by a set of lines, and Figure 2b shows the generalisation 
to the 2-dimensional case. This process is considerably 
simplified if the manifold is convex because then the hy- 
perplane slices off a localised piece of the manifold, as 
required. 

Figure 3 shows an example of how the convexity as- 
sumption can break down. The full embedding space 
contains the vector formed from an array of samples of 
a l-dimensional object function, but only three dimen- 
sions of the full embedding space are shown in Figure 
3. Several scenarios are shown ranging from a narrow 
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Figure 2: Using hyperplanes to slice pieces off convex curved 
manifolds. Slicing a curved manifold into pieces prepares it for 
mapping to another representation. (a) 1-dimensional man- 
ifold with arcs being sliced off by chords. (b) 2-dimensional 
manifold with caps being sliced off by planes (only a few of 
these are shown in order to keep the diagram simple). 


object (ie. undersampled) to a broad object (i.e. over- 
sampled). Oversampling leads to a smooth convex mani- 
fold, whereas undersampling leads to a concave manifold 
with cusps. Typically, convex manifolds occur in signal 
and image processing where the raw data are sampled 
at a high enough rate, and non-convex manifolds occur 
when the raw data has already been processed into a low- 
dimensional form, such as when some underlying degrees 
of freedom (or features) have already been extracted from 
the raw data. 


Figure 4 shows the results obtained by for a circular 
manifold using the stochastic vector quantiser (SVQ) ap- 
proach of Section III. The results correspond to Figure 
2a, except that now the slicing is done softly in order to 
preserve additional information about the manifold, and 
to ensure that the reconstructed manifold does not show 
artefacts when the slices are reassembled. 

Figure 5 shows an example of how a convex (1 +4 €)- 
dimensional manifold (a small length extracted from a 
cylindrical surface) can be cut into overlapping pieces by 
a set of planes. The 1 in (1+ €) is a large degree of 
freedom (arc length around the cylinder)), whereas the € 
in (1 + €) is a small degree of freedom (length along the 
cylinder) because it has a small amplitude compared to 
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Figure 3: Manifolds generated by a 1-dimensional object. The 
data vector is @ = (--- ,@~2,@-1,%0,%1,%2,--:) where x; = 
(i=a)? 


exp (- To? . a is the width of the object function, a (—co < 


a < 0) is the position of the object, and i (¢ = 0, +1, +2,---) 
is the location of the points where the object amplitude is 
sampled. The manifolds shown are 3-dimensional embeddings 
(a1, %2,2%3) of the 1-dimensional curved manifolds generated 
as a varies for a variety of object widths o. For o = 0.25 
(i.e. a narrow object function) the manifold is concave with 
cusps, as o is increased the concavity and the cusps become 
less pronounced until the manifold crosses the border between 
being concave and being convex, and for 0 = 1 (i.e. a wide 
object function) the manifold is smoothly convex. Concave 
manifolds with cusps are not well suited to being sliced apart 
by hyperplanes whereas smoothly convex manifolds are well 
suited, and this type of convex manifold is typical of high- 
dimensional data which also has a high resolution so that 
each object covers several sample points. 


the large degree of freedom. Because of the orientation 
of the planes they are insensitive to the small degree of 
freedom, so the reconstructed manifold is 1-dimensional 
manifold (i.e. the € component has been discarded). The 
orientation of the planes may be used in various ways to 
control their sensitivity to the manifold, and some quite 
sophisticated examples of this will be discussed in Sec- 
tion IV. These results generalise to soft slicing as used in 
Figure 4. 


Ill. LEARNING A MANIFOLD 


In order to learn how to represent the structure of a 
data manifold a flexible framework needs to be used. In 
this paper an approach will be used in which the manifold 
is mapped to a lower resolution representation in such a 
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Figure 4: Using a stochastic vector quantiser (SVQ) to map 
a curved manifold to a new representation (see Section III for 
details). The manifold is the unit circle which is softly sliced 
up by the SVQ posterior probabilities, which are defined as 
the (normalised) outputs of a set of sigmoid functions, which 
in turn depend on a set of weight vectors and biases. For 
each sigmoid function a dashed line is drawn to show where 
its (unnormalised) output is ,, although here it is the curved 
contours of the posterior probabilities (rather than the dashed 
lines) that are actually used to slice up the manifold. 


Figure 5: Using hyperplanes to slice pieces off a curved man- 
ifold. The manifold is 2-dimensional with a large and a small 
degree of freedom. Each hyperplane slices through the mani- 
fold in such a way that it cuts off a piece of the manifold that 
has a limited range of values of the large degree of freedom 
but all possible values of the small degree of freedom. This 
is the basic means by which a manifold can be mapped to a 
new representation. 


way that a good approximation to the original manifold 
can be reconstructed. Key requirements are that these 
mappings can be cascaded to form sequences of represen- 
tations of progressively lower resolution having more and 
more invariance with respect to details in the original 
manifold, and that the mappings can learn to represent 
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tensor products of manifolds so that the representation of 
the manifold can split into separate channels. To achieve 
this it is sufficient to use a variant [8] of the standard 
vector quantiser [3] to gradually compress the data. 

In Section II A the theory of stochastic vector quantis- 
ers (SVQ) is presented, and in Section IIIB it is extended 
to chains of linked SVQs. 


A. Stochastic Vector Quantiser 


As was discussed in Section II a procedure is needed 
for cutting a manifold into pieces and then reassembling 
these pieces to reconstruct the manifold. It turns out 
that all of the required properties emerge automatically 
from vector quantisers (VQ), and their generalisation to 
stochastic vector quantisers (SVQ). 


Pr(x| +o} CG) 


O(X1 — xy 


Pre | x1) @) 


D= [exo dx dzxo dx Pr(ao) Pr(a1|x0) O(&4 — x1) Pr(ao|a1) ||ao = £o|\" 


Using Bayes’ theorem Equation 3.1 can be manipulated 
into the form [4] 


D= 2 f aes dx, Pr(ao) Pr(a|a9) || — vo(a1)||7 


(3.2) 
where D must be minimised with respect to both the 
encoder Pr(a,|a) and the reconstruction vector x(#1). 
Note that this simplification of Equation 3.1 into Equa- 
tion 3.2 depends critically on the Euclidean form of the 
objective function. 

Figure 7 is a transformed version of Figure 6 that re- 
flects the transformation of Equation 3.1 into Equation 
3.2. The encoder Pr(a1|a) is the most important part of 
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Figure 6: A matched encoder/decoder pair represented as 
a folded Markov chain (FMC) ao > 21 > 21 > Zo. 
The input xo is encoded as a1 which is then passed along 
a distortionless communication channel to become x1 which 
is then decoded as %o. The encoder is modelled using the 
conditional probability Pr(a1|ao) to allow for the possibility 
that the encoder is stochastic, and the corresponding decoder 
is modelled using the Bayes inverse conditional probability 
Pr(ao|x1). The distortionless communication channel is mod- 
elled using the delta function 6(#1 — #1). 


Figure 6 shows a folded Markov chain (FMC) as de- 
scribed in [4]. An FMC encodes its input ap (i.e. cuts 
the input manifold into pieces using the conditional PDF 
Pr(a1|a9)) and then reconstructs an approximation to 
its input Zo (ie. reassembling the pieces to recon- 
struct the input manifold using the Bayes inverse PDF 
Pr(Zo|a1) = eee so it is ideally suited to 
the task at hand. An objective function D needs to be 
defined to measure how accurately the reconstruction Xo 
approximates the original input ao. 


It is simplest to use a Euclidean objective function 
that measures the average squared (ie. L?) distance 
\|@o — Zo||”, and which must be minimised with re- 
spect to the encoder Pr(a|ao) (note that the decoder 
Pr(ao|a1) is then completely determined by Bayes’ the- 
orem). 


this diagram, whereas the reconstruction vector a(x) 
is less important so it is shown as a dashed line. 

Thus far a non-parametric representation of Pr(a |x) 
and 2o(a,) has been used, so analytic minimisation of 
D |A] leads to Pr(ai|%o) —> 6(a@1 — 1(%)), in which 
case the encoder could be perfect (i.e. lossless) and 
it would not be possible to discover the structure of 
the input manifold, as discussed in Section I]. To make 
progress constrained forms of Pr(a1|ao) and vo(a) must 
be used in order to limit the resources available to the en- 
coder /decoder, and thus force it to discover clever ways of 
mapping the input manifold to reduce the damage caused 
by having only limited coding resources. 


One way of constraining the encoder/decoder is for #1 
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Figure 7: An encoder/decoder pair represented as the chain 
Lo —> 21 —> xo0(x#1). This contains only those parts of the 
FMC that affect the Euclidean distortion objective function. 


2 ie 2 
D< — P P — 
Ss [exo r(ao) Dy r(y1|%o) ||@o — #o(y1) ||" + 


yi=1 


where the unconstrained encoder/decoder corresponds to 
the left hand side of Equation 3.3 and the constrained 
encoder/decoder corresponds to the right hand side of 
Equation 3.3. Note how the random fluctuations in the 
multiple sample histogram are analytically summed over 
in Equation 3.3, leaving only the single sample encoder 
Pr(y,|a@o) to be optimised. 

A further constraint is to assume that Pr(y1|ao) is pa- 
rameterised as the normalised output of a set of sigmoid 
functions 


— — Qyile@o) 
Pr(yi|@o) = ya Ono) (3.4) 
Q(y1|£o) ™ TFexp(—wio(yi)-@0—61 (yi) 
Q(y1|®o) 
Pr(yi|%0) = am 
wei (y1'|@o) 
1 
Q(yil@o) = ey 


1+ exp (—wio(y1)-%o — b1(y1)) 


where Q(yi|%@o) is the unnormalised output from code 
index y1, depending on the weight vector wio(y1) and the 
bias bi(y). This parameterisation of Pr(y:|ao) ensures 
that it can be used to slice pieces off convex manifolds 
as illustrated in Figure 4. Optimisation of the objective 
function is then achieved by gradient descent variation of 
the three sets of parameters wio(y1), bi(yi1), and xo(y1). 
These (and other) derivatives of the objective function 
were given in [5]. 

The constrained objective function in Equation 3.3 
and Equation 3.4 yields a great variety of useful results, 


AD f ato Pr(ao) 


to be a scalar index y; where y; = 1,2,--- ,m, (m, is the 
size of the code book), which is a single sample drawn 
from the encoder Pr(a|a%). Analytic minimisation of 
D [A] now leads to Pr(a1|a%o) —> dy, ,y; (a) So that D = 
2 [dao Pr(ao) ||ao — xo(y1(ao)) ||? which is the objective 
function for a standard least squares vector quantiser [3]. 


A better way of constraining the encoder/decoder is for 
x, to be the histogram (11, V2,°-* ,Um,) of counts of inde- 
pendent samples of the scalar index yy (ny = 7y"h 1 Vy 
is the total number of samples), and for a9(x1) to be ap- 
proximated as xo(a@1) © Yoy*hy ou xo(y1) (rather than 
using the full functional form 29(1,12,-++ ,Um,)). Al 
though it is still possible to obtain analytic results it 
usually requires a lot of calculation [6], and it is gener- 
ally better to use a numerical optimisation approach. An 
upper bound for the objective function D is then given 


by [5] 


mi 


ao — >) Pr(yilwo) o(y1) 


yal 


(3.3) 


n 


such as the simple result shown in Figure 4 which used 
m, = 6, n, = 20, and a = (cos 9, sin @) with 6 uniformly 
distributed in [0,27], to learn a mapping from the 1- 
dimensional input manifold embeded in a 2-dimensional 
space (dimap = 2) to a 6-dimensional space (m, = 6). 
This is the key objective function that can be used to 
optimise the mapping of the input manifold to a new 
representation Pr(y;|ao) for y; = 1,2,--- ,m. Minimis- 
ing the Euclidean distortion ensures that Pr(yi|ao) de- 
fines an optimal mapping of the input manifold, such that 
when n, samples are drawn from Pr(y;|a) they contain 
enough information to form an accurate reconstruction 
of xo. 


Figure 8 is a transformed version of Figure 7 that shows 
an example of the structure of some types of optimal so- 
lution that are obtained by minimising the constrained 
objective function in Equation 3.3. The tensor product 
structure of the input manifold is revealed in this type 
of solution, because the input vector ao splits into two 
parts as a = (22,22) each of which is separately en- 
coded/decoded. This type of factorial encoder is favoured 
by limiting the size of the code book m and by using an 
intermediate number of samples n,; (n, = 1 leads to a 
standard VQ, and n, —> o allows too many coding 
resources to lead to clever encoding schemes). 


The self-organised emergence of factorial encoders is 
one of the major strengths of the SVQ approach. It allows 
the code book to split into two or more separate smaller 
code books in a data driven way rather than being hard- 
wired into the code book at the outset (e.g. [10]). 
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Pr(xf | x0) 


Figure 8: A factorial encoder/decoder pair represented as 
the pair of disconnected chains r§ —> x{ —> xo(a#{) and 
x) —> «o — > a}(x'). The input vector is ao = (x2, 22) high- 
lighted by the left hand rectangle, the code is a1 = (xf, x2) 
highlighted by the right hand rectangle, and the reconstruc- 
tion is 0 = (x9, x). The dependencies amongst the variables 
is indicated by the arrows in the diagram, which shows that 
subspaces a and b are independently encoded /decoded. 


B. Chain of Stochastic Vector Quantisers 


The encoder/decoder in Figure 7 leads to useful results 
for mapping the input data manifold when its operation 
is constrained in various ways. A much larger variety 
of mappings may be constructed if the encoder/decoder 
is viewed as a basic module, and then networks of linked 
modules are used to process the data [8]. It is simplest to 
regard this type of network as progressively mapping the 
input manifold as it flows through the network modules. 


Pr(x1_| Xo) — Pr(xo| x1) — Pr(x3_| x2) 


&0) ) @) 3) 


Figure 9: A 3-stage chain of linked SVQs. The I‘? encoder is 
modelled using the conditional probability Prj;—~1(a1|x7_1), 
and the corresponding decoder is modelled using the recon- 
struction vectors #;-1(a1), where each a; is a histogram of 
samples. 


Figure 9 shows a 3-stage chain of linked en- 


175 


coder /decoders of the type shown in Figure 7. The im- 
portant part of this diagram is the processing chain which 
is the solid line flowing from left to right at the top 


P 
of the diagram creating the Markov chain ao eealRw) 


Pr(a2|ax1) Pr(a3|a1) : 
1 > & > 23. The reconstruction vec- 


tors z;-1(a,) for | = 1,2,3 are the dashed lines flowing 
from right to left. 


The state a, of layer | of the chain is the _his- 
togram (11, 2,°+-: ,Y%m,) of counts of samples drawn from 
Pr(yi|a1-1) (ri = YOY", Vy, is the total number of sam- 
ples). In numerical implementations x; is chosen to be 
the (normalised) histogram for an infinite number of sam- 
ples (i.e. the relative frequencies implied by Pr(y:|a_1)), 
and a _;(a ,) is chosen to depend on only a finite num- 
ber m; of samples randomly selected from this histogram 
@-1(a@1) & pais on zi-1(y). This choice of how to op- 
erate the network is not unique but it has the advantage 
of simplifying the computations. The infinite number of 
samples used in x; ensures that the a; do not randomly 
fluctuate, so no Monte Carlo simulations are required to 
implement the feed-forward flow through the network. 
The finite number of samples n; used in a)_1(a ) leads 
to exactly the same objective function as in Equation 3.3 
where the random fluctuations are analytically summed 
over, which ensures that each decoder has limited re- 
sources and thus forces the optimisation of the network 
to discover intelligent ways of encoding the data. This 
type of network reduces to a standard way of using a 
Markov chain when only a single sample is drawn from 
each of the Pr(y|a_1). 


Each stage of the chain corresponds to an objective 
function of the form shown in Equation 3.3 but applied 
to the I*” stage of the chain. The total objective function 
is a weighted sum of these individual contributions. This 
encourages all of the mappings in the chain to minimise 
their average Euclidean reconstruction error, which gives 
a progressive mapping of the input manifold along the 
processing chain. However, the relative weighting of the 
later stages of the chain must not be too great otherwise 
they force the earlier mappings in the chain to become 
singular (e.g. all inputs mapped to the same output), be- 
cause the output of a singular mapping can be mapped 
with little or no more contribution to the overall objective 
function further along the chain. A less extreme form of 
this phenomenon can be used to encourage factorial en- 
coders to emerge, because they produce a (normalised) 
histogram output state that has a smaller volume (in the 
Euclidean sense) than a non-factorial encoder, which re- 
duces the size of the contribution to the overall objective 
function from the next stage in the chain. 


If the chain network topology in Figure 9 is combined 
with the factorial encoder/decoder property of SVQs 
shown in Figure 8 then all acyclic network topologies 
are possible. This can be seen intuitively because flow 
through the chain corresponds to flow along the time- 
like direction in an acyclic network (i.e. following the di- 
rected links), and multiple parallel branches occur wher- 
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ever there is an SVQ factorial encoder in the chain. An 
example of the emergence of this type of network topol- 
ogy will be shown in Section IV. 


IV. LEARNING A HIERARCHICAL NETWORK 


The purpose of this section is to demonstrate the self- 
organised emergence of a hierarchical network topology 
starting from a chain-like topology of the type shown 
in Figure 9. For the purpose of this demonstration the 
raw data must have an appropriate correlation structure, 
which will be achieved by generating the data as a set of 
hierarchically correlated phases. Thus each data vector 
is a 4-dimensional vector of phases @ = (41, ¢2, ¢2, 4), 
where the @; are the leaf nodes of a binary tree of phases, 
where the binary splitting rule used is 6 —> (¢—a, 6+8) 
with a and / being independently and uniformly sampled 
from the interval [0,5], and the phase of the root node 
is uniformly distributed in the interval [0,27]. This will 
lead to each of the ¢; being uniformly distributed phase 
variables thus uniformly occupying a circular manifold. 
However, because the ¢; are correlated due to the way 
that they are generated by the binary splitting process, 
the @ do not uniformly populate a 4-torus manifold (i.e. 
tensor product of 4 circles). 

Figure 5 showed an example of what the manifold of a 
pair of correlated variables looks like, with a large degree 
of freedom (e.g. ¢; + ¢2) and a small degree of freedom 
(e.g. ¢1—2), and the hyperplanes encoding the manifold 
in such a way as to discard information about the small 
degree of freedom (e.g. 6, — ¢2). In this way the 3-stage 
chain in Figure 9 can progressively discard information 
about small degrees of freedom in @, starting with a 4- 
torus manifold (non-uniformly populated) and ending up 
with a circular manifold (uniformly populated), as will be 
seen below. This is the basic idea behind using this type 
of self-organising network for data fusion. 

Figure 10 shows the co-occurrence matrices of pairs of 
the ¢; displayed as scatter plots. The bands in these plots 
wrap around circularly and correspond to the manifold 
shown in Figure 5. Because ¢; and ¢2 (and also ¢3 and 
$4) lie close to each other in the hierarchy, Pr(¢1,¢2) 
and Pr(¢3,¢4) have a narrower band than Pr(¢1,¢3), 
Pr(1,¢4), Pr(¢2,¢3) and Pr(2,¢4). 

A 3-stage chain of linked SVQs of the type shown in 
Figure 9 is now trained, where each stage contributes 
an objective function of the form shown in Equation 3.3 
and Equation 3.4. The sizes M of each of the 4 net- 
work layers are M = (8,16,8,4). The size of layer 0 (the 
input layer) is determined by the dimensionality of the 
input data, whereas the sizes of each of the other layers 
is chosen to be 4 times the number of phase variables 
that each is expected to use in its encoding of the input 
data, which encourages the progressive removal of small 
degrees of freedom from the data as it flows along the 
chain into ever smaller layers. The number of samples n 
used for each of the 3 SVQ stages are n = (20, 20, 20), 
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Figure 10: Co-occurrence matrices of all pairs of phases. Note 
that the block structure is symmetric so each off-diagonal co- 
occurrence matrix appears twice. The hierarchical correla- 
tions cause ¢1and ¢2 (and similarly ¢3 and ¢4) to be more 
strongly correlated with each other than ¢2 and @3. 


which are large enough to allow each SVQ to develop 
into a factorial encoder, so that the processing can pro- 
ceed in parallel along several paths (which are progres- 
sively fused) along the chain. The relative weightings A 
assigned to the objective functions contributed by each 
of the 3 SVQ stages are A = (1,5,0.1), where a large 
weighting is assigned to the stage 2 SVQ to encourage 
the stage 1 SVQ to develop into a factorial encoder, and 
a small weighting is assigned to the stage 3 SVQ because 
the stage 2 SVQ needs no additional encouragement to 
develop into a factorial encoder. 

The network was trained by a gradient descent on the 
overall network objective function, using a step size cho- 
sen separately for each SVQ stage and separately for each 
of the 3 parameter types w77—-1(y), bi(yz), and x_1,1(y) 
in each SVQ stage (for | = 1,2,3). The size of all of 
these step size parameters was chosen to be large at the 
start of the training schedule, and then gradually reduced 
as training progressed, with the relative rate of reduction 
being chosen to encourage earlier SVQ stages to converge 
before later SVQ stages. In general, different choices of 
network parameters and training conditions lead to dif- 
ferent types of trained network, and since there is no prior 
reason for choosing one particular solution in preference 
to another the choice must be left up to the user. All the 
components of the weight vectors, biases, and reconstruc- 
tion vectors are initialised to random numbers uniformly 
distributed in the interval [—0.1, 0.1]. 

Figure 11 shows the reconstruction vectors x)—1,1(y) 
(for | = 1,2,3) in a trained 3-stage chain of linked 
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Figure 11: Reconstruction vectors xj~-1,:(y:) (for 1! = 1,2,3) 
after training a 3-stage network of linked SVQs on the hierar- 
chically correlated phases data. These diagrams are rotated 
90° anticlockwise relative to Figure 9, so the processing chain 
runs from bottom to top of each diagram. Line thickness in- 
dicates the size of a reconstruction vector component, and 
dashing indicates that the component is negative. (a) All re- 
construction vector components. (b) Largest reconstruction 
vector components obtained by applying a threshold to the 
magnitude of each component. (c) The same as (b) except 
that the positions of the nodes in all layers (other than the 
input layer) have been permuted (along with the connections 
between layers) to make the network topology clearer. 


SVQs of the type shown in Figure 9.  Reconstruc- 
tion vectors are displayed because they are easier 
to interpret than weight vectors. The data @ = 
(¢1, 62, ¢2,¢4) is embedded in an 8-dimensional in- 
put space as @ = (21,%2,%3,%4,U5,X6,07,2g) where 
(toi-1,2;) = (cos¢;, sing;). The key diagram is Fig- 
ure llc which shows the largest components of the re- 
construction vectors, and has been reordered to make 
the hierarchical network topology clear. Each of the first 
two stages of this network has learnt to operate as two or 
more encoder /decoders (i.e. a factorial encoder/decoder) 
as in Figure 8. The first stage of the network breaks 
into 4 encoder/decoders that encode each of the ¢; (see 
the results in Figure 14 and Figure 15 for justification 
of this), the second stage of the network breaks into 2 
encoder /decoders that encode ¢; + ¢2 and ¢3 + d4 (see 
the results in Figure 16 and Figure 17 for justification 
of this), and the third stage of the network is a single 
encoder that encodes $1 + ¢2 + $3 + ¢4 (see the results 
in Figure 18 and Figure 19 for justification of this). The 
connectivity in the stage 1 SVQ is not the same for all of 
the ¢; because of the interaction between the threshold- 
ing prescription used to create Figure 11 and the different 
orientation of each of the 4 parts of the stage 1 factorial 
encoder with respect to each of the 4 corresponding cir- 
cular input manifolds. 


Although the network in Figure 11 computes using 
continuous-valued numbers, the thresholded reconstruc- 
tion vectors in Figure 11b may be inspected to reveal 
the symbolic logic expressions that approximate to each 
of the (thresholded) outputs O;(a) for i = 1,2,3,4 from 
the highest layer of the network (using logical negation %; 
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to denote —x; because the inputs lie in the range [—1, 1]). 


O;(x) = To x3 ial x5 1 xg 
Oo(x) = %& 1%3 NM 25 1% 
= 01(2) 
O3(x) = MONG MN U7 
Ox(a) = 21 N2@4N wg NE 
= O3(x) (4.1) 
In order to keep these expressions short they use a slightly 
higher threshold than was used to create Figure 11b, be- 
cause this suppresses some of the reconstruction vector 
components linked to (#1, 22,23, 24). In this particularly 
simple example it is possible to obtain very short sym- 
bolic expressions, but more generally continuous-valued 
computations would be needed to obtain good approxi- 
mations to the network outputs. 


Figure 12: Some typical examples of node activities 
Pr(y:|a:-1) (for | = 1,2,3) in the trained 3-stage network 
of linked SVQs. The permuted version of the network is used 
to make the results easier to interpret. The area of each filled 
circle is proportional to the activity it represents, and the neg- 
ative values that occur in the input layer are represented by 
unfilled circles. The hierarchical structure of the network is 
indicated by drawing a box around each part of each network 
layer that acts as a separate encoder, so that typically there 
are one or two active nodes within every box. 


Figure 12 shows some examples of the node activities 
Pr(y:|ai—1) (for 1 = 1, 2,3) in the trained network shown 
in Figure llc. The individual pieces of each factorial 
encoder are indicated by the boxes, and the patterns of 
activity are such that every box contains one or more ac- 
tive nodes, as would be expected if each box were acting 
as a separate encoder. 
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Figure 13: Two ways of encoding a pair of correlated phases ¢1 
and ¢2. The co-occurrence matrix of ¢; and ¢2 is represented 
as a narrow band that is populated by data points, so that the 
correlation manifests itself as ¢1 * ¢2. (a) This shows how 
an invariant encoder operates, in which the response region 
of each node is oriented so that it has high resolution for the 
o1 + ¢2 but is completely insensitive to ¢1 — ¢2. This does 
not encode information about the small degree of freedom 
measured across the band of the co-occurrence matrix. (b) 
This shows how a factorial encoder operates, in which the 
response region of each node is highly anisotropic, with high 
resolution for one of the phases but completely insensitive to 
the other phase. Accurate encoding is achieved by using the 
nodes in pairs with orthogonally intersecting response regions, 
as shown in the example highlighted in the diagram. 


Figure 13 shows simplified versions of the two types 
of encoder/decoder that occur in Figure 11 overlaid on a 
co-occurrence matrix of the type shown in Figure 10. 


node 1 node 2 node 3 node 4 


2 


g2 
osneonao 

$2 
Oosnonao 

$2 
osnweanao 


04 aot 
0123456 


$2 
osnwesan 

2 
osneraon 

$2 
Oosnesaon 


; ye 


Biase E see 
0123456 0123456 
1 1 1 


node 10 node 11 node 12 


$2 
onnwnan 
g2 
osnenao 
$2 
osneonan 
$2 
osnwenao 


0123456 0123456 
$1 v1 1 


0123456 


node 14 node 15 node 16 


$2 
Onnwnan 

¢2 
onNnonaoD 


v1 1 1 $1 


Figure 15: Node activities in layer 1 as a function of the inputs 
3 and ¢4 (with ¢1 = ¢2 = 0) which has the properties of a 
factorial encoder. The non-zero responses correspond to the 
8 nodes in layer 1 that are strongly connected to $3 and ¢a. 
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Figure 14: Node activities in layer 1 as a function of the inputs 
gi and ¢2 (with ¢3 = ¢4 = 0) which has the properties of 
a factorial encoder. In each plot the contours representing 
the node reponse are overlaid on the co-occurrence matrix of 
the pair of inputs ¢1 and ¢2. One of the contour heights is 
drawn bold to highlight the region where the node response 
is large. Half of the nodes do not respond at all, and the 
other half split into two subsets of equal size, one with high 
resolution in ¢; but completely insensitive to ¢2, and the 
other with high resolution in ¢2 but completely insensitive to 
¢o1. The response is sensitive to the small degree of freedom 
measured across the band of the co-occurrence matrix. The 
non-zero responses correspond to the 8 nodes in layer 1 that 
are strongly connected to ¢1 and ¢e. 


Figure 14 and Figure 15 show the encoders that occur 
in stage 1 of Figure 11. These are all factorial encoders 
of the type shown in Figure 13b, as can be seen from the 
orientation of the response regions for the various nodes 
which cuts across the band of the co-occurrence matrix 
in the same way as in Figure 13b. This corresponds to 
the connectivity seen in Figure 11c where each ¢; has its 
own encoder. 

Figure 16 and Figure 17 show the encoders that occur 
in stage 2 of Figure 11. These are invariant encoders of 
the type shown in Figure 13a, as can be seen from the 
orientation of the response regions for the various nodes 
which cuts across the band of the co-occurrence matrix 
in the same way as in Figure 13a. This corresponds to 
the connectivity seen in Figure 11c where each of ¢, + ¢2 
and ¢3 + ¢4 has its own encoder. 

Figure 18 and Figure 19 show the encoder that occurs 
in stage 3 of Figure 11. This is an invariant encoder of 
the type shown in Figure 13a, which corresponds to the 
connectivity seen in Figure 11c. 


The diagrams in this section show how a 3-stage chain 
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Figure 16: Node activities in layer 2 as a function of the inputs 
gi and ¢2 (with ¢3 = ¢4 = 0) which has the properties of 
an invariant encoder. Half of the nodes do not respond at 
all, and the other half respond to well-defined regions in ¢1 
and ¢2. The contours representing the node response are 
overlaid on the co-occurrence matrix of the pair of inputs ¢1 
and ¢2. One of the contour heights is drawn bold to highlight 
the region where the node response is large. This shows that 
each node responds to a local region of the populated region of 
the co-occurrence matrix, and to a limited extent generalises 
outside this region. The response is invariant with respect to 
the small degree of freedom measured across the band of the 
co-occurrence matrix, which demonstrates that layer 2 has 
acquired an invariance that was absent in layer 1. There are 
also non-zero responses in the unpopulated region of the co- 
occurrence matrix which arise because the sum of the node 
activities is normalised. The non-zero responses correspond 
to the 4 nodes in layer 2 that are strongly connected to ¢1 
and ¢2. 
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Figure 17: Node activities in layer 2 as a function of the inputs 
o3 and ¢4 (with ¢1 = ¢2 = 0) which has the properties of an 
invariant encoder. The non-zero responses correspond to the 
4 nodes in layer 2 that are strongly connected to ¢3 and da. 


of linked SVQs of the type shown in Figure 9 self- 
organises to process hierarchically correlated phase data. 
Stage 1 makes an approximate copy of the data where 
each phase is separately encoded, then stage 2 encodes 
the output of stage 1 discarding the smallest degrees of 
freedom, and finally stage 3 encodes the output of stage 
2 discarding the next smallest degree of freedom. The 
chain of SVQs has thus split itself into a hierarchical net- 
work of linked encoders that is optimally matched to the 
task of mapping from the original data at the input to 
the chain to the compressed representation at the output 
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Figure 18: Node activities in layer 3 as a function of the inputs 
gi and ¢2 (with ¢3 = ¢4 = 0) which has the properties of an 
invariant encoder. 


node 1 


Figure 19: Node activities in layer 3 as a function of the inputs 
$3 and ¢4 (with ¢1 = ¢2 = 0) which has the properties of an 
invariant encoder. 


of the chain. This is true self-organisation of multiple 
encoders unlike the hard-wiring of encoders that is used 
in other approaches (e.g. [10]). 


12 Using Self-Organising Mappings to Learn the Structure of Data Manifolds 


V. CONCLUSIONS 


This paper has shown how it is possible to map a data 
manifold into a simpler form by progressively discard- 
ing small degrees of freedom. This is the key to self- 
organising data fusion, where the raw data is embedded 
in a very high-dimensional space (e.g. the pixel values 
of one or more images), and the requirement is to iso- 
late the important degrees of freedom which lie on a 
low-dimensional manifold. A useful advantage of the ap- 
proach used in this paper is that it assumes only that 
the mapping from manifold to manifold is organised in a 
chain-like topology, and that all the other details of the 
processing in each stage of the chain are to be learnt by 
self-organisation. The types of application for which this 
approach is well-suited are ones in which separation of 
small and large degrees of freedom is desirable. For in- 
stance, separation of targets (small) and jammers (large) 
is relatively straightforward using this approach [7]. 

Data that is not embedded in a higher-dimensional 
space is usually not suitable for processing with the ap- 
proach used in this paper. For instance, categorical (or 
symbolic) data that has one of only a few possible states 
is not suitable, but a smoothly variable array of pixel val- 
ues is suitable. This type of network is intended to op- 
erate on raw sensor data rather than pre-processed data, 
and typically will use high-dimensional intermediate rep- 
resentations in its processing chain. Because of its ability 
to compress the raw data into a much simpler form, this 
type of network would typically be used as a bridge be- 
tween the sub-symbolic raw sensor data and the symbolic 
higher level representation of that data. 

Although the full connectivity between adjacent layers 
of the chain implies that the computations can be expen- 


[1] M Dittenbach, D Merkl, and A Rauber, The growing 
hierarchical self-organising map, Proceedings of interna- 
tional joint conference on neural networks (Como) (S-I 
Amari, C L Giles, M Gori, and V Piuri, eds.), IEEE 
Computer Society, 2000, pp. 15-19. 

[2] T Kohonen, Self-organising maps, 
Berlin, 2001. 

[3] Y Linde, A Buzo, and R M Gray, An algorithm for vector 
quantiser design, IEEE Transactions on Communications 
28 (1980), no. 1, 84-95. 

[4] SP Luttrell, A Bayesian analysis of self-organising maps, 
Neural Computation 6 (1994), no. 5, 767-794. 

[5] , Mathematics of neural networks: models, algo- 

rithms and applications, ch. A theory of self-organising 

neural networks, pp. 240-244, Kluwer, Boston, 1997. 

, Combining artificial neural nets: Ensemble and 

modular multi-net systems, Perspectives in Neural Com- 

puting, ch. Self-organised modular neural networks for 


Springer-Verlag, 


[6] 


sive, after some initial training the factorial structure of 
the various encoders becomes clear and can be used to 
prune the connections to keep only the ones that are actu- 
ally used (i.e. usually only a small proportion of the total 
number). When the chain is fully trained each code index 
typically depends on only a small number of contribut- 
ing inputs (i.e. a receptive field) from the previous stage 
of the chain. Furthermore, because of the normalisation 
used in each layer, the size and shape of the receptive 
fields mutually interact (i.e. there is a fixed total amount 
of activity in each layer), so the raw receptive fields (i.e. 
as defined by the feed-forward network weights) are dif- 
ferent from the renormalised receptive fields (i.e. after 
taking account of normalisation). 

The network described in this paper passes informa- 
tion along the processing chain in a deterministic fashion, 
because it uses (hypothetical) histograms containing an 
infinite number of samples, which thus do not randomly 
fluctuate. This was done for computational convenience 
(i.e. to avoid Monte Carlo simulations) and is not a fun- 
damental limitation of the approach used. With addi- 
tional computational effort it is possible to operate the 
network as a (non-deterministic) Markov chain in which 
the histograms contain only a finite number of samples, 
which therefore randomly fluctuate and explore network 
states in the vicinity of the deterministic state used in 
this paper. 


VI. ACKNOWLEDGEMENTS 


The research presented in this paper was supported 
by the United Kingdom’s MoD Corporate Research Pro- 
gramme. 


encoding data, pp. 235-263, Springer-Verlag, London, 
1999. 
[7] , Mathematics in signal processing, vol. 5, ch. Us- 
ing stochastic vector quantisers to characterise signal and 
noise subspaces, pp. 193-204, Oxford University Press, 


2002. 


[3] , A Markov chain approach to multiple classifier 
fusion, Proceedings of international workshop on multi- 
ple classifier fusion (London) (T Windeatt and F Roli, 
eds.), Springer-Verlag, 2003, pp. 217-226. 

[9] T M Martinez and K J Schulten, Artificial neural net- 
works, ch. A "neural-gas” network learns topologies, 
pp. 397-402, North-Holland, Amsterdam, 1991. 

[10] D A Ross and RS Zemel, Advances in neural informa- 
tion processing systems, vol. 15, ch. Multiple-cause vec- 
tor quantisation, pp. 1041-1046, MIT Press, Cambridge, 
2002. 


180 


Discrete Network Dynamics. Part 1: Operator Theory * 
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An operator algebra implementation of Markov chain Monte Carlo algorithms for simulating 
Markov random fields is proposed. It allows the dynamics of networks whose nodes have discrete 
state spaces to be specified by the action of an update operator that is composed of creation and 
annihilation operators. This formulation of discrete network dynamics has properties that are similar 
to those of a quantum field theory of bosons, which allows reuse of many conceptual and theoretical 
structures from QFT. The equilibrium behaviour of one of these generalised MRFs and of the 
adaptive cluster expansion network (ACEnet) are shown to be equivalent, which provides a way of 


unifying these two theories. 


I. INTRODUCTION 


The aim of this paper is to present a theoretical frame- 
work for building recurrent network models where the 
states of the network nodes are discrete-valued, which 
will define a general framework for discrete information 
processing that can be implemented in various computa- 
tional architectures. The introduction of recurrence into 
networks makes them much more difficult to analyse and 
control than feed-forward networks. The basic reason for 
these difficulties is that loopy propagation in recurrent 
networks causes each network observable to be a sum of 
an infinite (or, at least, a very large) number of contri- 
butions. 


One type of network that can be modelled using this 
framework is a network of spiking neurons, where the 
presence or absence of a spike is a binary quantity (i.e. 
it is discrete-valued). However, in this paper, there is no 
specific aim to model biological information processing, 
but there will nevertheless be points of contact between 
the general information processing framework presented 
here and the specific details of biological information pro- 
cessing. 

The only consistent way of processing information is to 
use Bayesian methods [2], which represent information by 
using the joint probability of the states of the network 
nodes, and process information (or make inferences) by 
manipulating these joint probabilities according to well- 
defined rules such as Bayes theorem. The Bayesian ap- 
proach achieves its consistency by not discarding any of 
the various alternative inferences that can be made, and 
by following up the consequences of all of the alternatives 
it ensures that there are never any of the contradictions 
that would otherwise occur, such as reaching conclusions 
that depend on which route one takes through the maze 
of inferences. 
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Bayesian information processing needs a flexible way 
of representing and manipulating joint probabilities. An 
ideal framework for this is Markov random field (MRF) 
theory [8], because it allows one to systematically build 
up a joint probability model out of pieces that have a sim- 
ple functional dependence on the underlying state vari- 
ables. For networks that have a finite number of nodes, 
each of which has a finite number of states, the MRF ap- 
proach allows all possible joint probability models to be 
constructed, so use of the MRF framework imposes no 
artificial constraints. Because the MRF approach con- 
structs a joint probability model, it can be cleanly cou- 
pled to any other probability modelling approach. 


The implementation of MRFs is usually done using 
stochastic Markov chain Monte Carlo (MCMC) compu- 
tations, unless the MRF happens to have a particularly 
simple topology which allows a simpler deterministic im- 
plementation to be achieved (e.g. a tree-like topology 
allows exact computations to be done). In this paper no 
simplifying assumptions will be made about the network 
topology, in order to create the most general possible 
theoretical framework for discrete information process- 
ing. The simplest type of MCMC computation stochasti- 
cally updates the joint state of the MRF, so that it moves 
around its joint state space visiting every joint state with 
a frequency that is proportional to the joint probability 
specified by the MRF. More sophisticated MCMC com- 
putations do the same thing but with an ensemble of joint 
states of the MRF; these are known as “particle filtering” 
algorithms [3]. 


The main result that is presented in this paper is a 
new way of describing MCMC algorithms, in which the 
updating of the MRF joint state (i.e. the joint state 
of the network nodes) is decomposed into a set of more 
elementary operations, which are the creation and anni- 
hilation of network node states. In the simplest case, a 
single MCMC update changes the joint state of an MRF 
by modifying its state at a single node of the network, 
which can be decomposed into first annihilating the old 
node state then creating the new node state. Any MCMC 
algorithm can be composed out of a sequence of such 
creation and annihilation operations. Furthermore, the 
properties of the operators that enact these creation and 
annihilation operations are very familiar to physicists, 


because they are identical to the properties of the cre- 
ation and annihilation operators that appear in a quan- 
tum field theory (QFT) of bosons [9]. This allows a lot 
of prexisting conceptual and computational machinery 
to be brought to bear upon the problem of describing 
MCMC algorithms. By drawing an analogy with multi- 
particle QFT states, the MRF framework can be consis- 
tently generalised so that each node of the network exists 
in a multiply occupied state, rather than a singly occu- 
pied state. There are also many other points of contact 
with QFT. 

The generalisation of the MRF framework to multi- 
ply occupied node states allows contact to be made with 
a particular type of self-organising network (SON) the- 
ory known as the adaptive cluster expansion network 
(ACEnet) [6]. One of the aims of a SON is to discover 
for itself what network architecture to use to solve an 
information processing task, so it must be able to dy- 
namically change its architecture. This requires splitting 
and merging of network nodes, and also the creation of 
appropriate links between them. In an MRF, if a node is 
split into two nodes there is no consistent way of assigning 
a pairwise state to the resulting pair of nodes, unless the 
preexisting single node had two (or more) states assigned 
to it in the first place. This is exactly what multiple 
occupancy in the generalised MRF framework provides, 
using creation and annihilation operators to manipulate 
these states. Thus the creation and annihilation opera- 
tor approach allows MRF theory and SON theory can be 
cleanly unified. 

The structure of this paper is as follows. In Section 
II the theory of MRFs is summarised, together with the 
details of MCMC algorithms for simulating MRFs. In 
Section III the main new contribution of this paper is 
presented, which is an operator implementation of the 
MCMC algorithm that generalises MRF theory to multi- 
ple occupancy states. Finally, in Section IV some simple 
applications are used to illustrate the use of this oper- 
ator implementation, one of which is the demonstration 
that the equilibrium state of a particular type of multiply 
occupied MRF has the same properties as ACEnet. 


II. MARKOV RANDOM FIELDS 


The aim of this section is to review the MRF frame- 
work for building and manipulating the joint probabil- 
ity models that are used when doing Bayesian informa- 
tion processing. This includes some informal material 
in which multiple occupancy of node states is discussed 
before giving the more formal development later on in 
Section III. 

Section IIA introduces MRFs and the Hammersley- 
Clifford expansion of joint probabilities, and Section IIB 
describes an MCMC algorithm for sampling the joint 
states of an MRF. Section IID introduces the concept 
of a multiple occupancy state which is essential for the 
generalisation of MRFs that is presented later in Section 
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III. Finally, Section II C describes how MRFs can be used 
to do Bayesian inference. 


A. Basic Markov Random Field Theory 


MRFs are a flexible way of constructing joint probabili- 
ties based on the Hammersley-Clifford expansion (HCE), 
which is defined as [1] 


Pr(x) = 5] [1 ek(@.) 
Al 2c 


where « is the joint state (@1,72,--- ,an) of an MRF 
with N nodes, k is the order of the term in the expan- 
sion (i.e. k is the number of components of a that the 
term depends on, which is thus a k-tuple), c labels the 
particular k-tuple (or k-clique) that the term depends on, 
z~ is the k-tuple (or clique state), p*(a-) is the probabil- 
ity factor (or clique factor) associated with x, and Z isa 
normalisation factor to ensure that the total probability 
sums to unity as }>,, Pr(x) = 1, so Z is defined as 


Z= > [LT] e%(-) 
ex k c 


There are some minor technical issues to do with exactly 
how the states of the x, are enumerated in the HCE to 
ensure that states are not double-counted, but these are 
not important here. 

To compute the average (S) of a statistic S(a) you 
need to evaluate the following 


(S) = S$) Pr(a) S(a) 


Ye He [1 pe (ee) 


where the probability factor Pr(a) appropriately weights 
the contribution of each x in the sum, so that overall the 
correct weighted average (S) is computed. Despite the 
functional simplicity of the HCE expression for Pr(a), it 
is usually not possible to evaluate Equation 2.3 in closed- 
form, so numerical techniques must be used. 

An intuitive feel for how Equation 2.3 can be evaluated 
can be obtained by noting that the relative probability 
of a pair of joint states x, and 22 is given by 


Pr(wi) _ Te T1.p§((#1).) 
Pr(@) — T], I]. pE((@2),) 


where the normalising Z factor in Equation 2.1 cancels, 
and also any factors in common between the numerator 
and denominator of the ratio in Equation 2.4 will can- 
cel. Thus, if the joint states x, and 2» differ in only a 
few of their vector components, then any of the probabil- 
ity factors p*(x.) that do not depend on these differing 
components will cancel out, leaving a relatively simple 


(2.1) 


(2.2) 


(2.4) 
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peek, This cancellation is a key 


property of the functional form of the HCE in Equation 
2.1. Once a simple expression for the relative probability 
ae of a pair of joint states 2, and 22 is available, it 
can be used to define an MCMC algorithm (see Section 
IIB) for hopping around between the various joint states 
zx, and which is designed to visit each joint state with a 
frequency that is propartional to Pr(x), as is required for 
computing a numerical estimate of ($) in Equation 2.3. 


expression for the ratio 


B. Markov Chain Monte Carlo Algorithm 


It is possible to construct an MCMC algorithm for hop- 
ping between joint states of an MRF that respects their 
relative probability of occurrence. It is not trivially obvi- 
ous how to design a hopping algorithm with these prop- 
erties, because one has to consider the net effect of all 
of the ways that one’s proposed algorithm can hop in to 
and out of each state, and to check that this does indeed 
give rise to the correct joint Pr(a). 

Consider a network of nodes whose joint state of its 
nodes splits into two parts (x, y) whose joint probability 
is Pr(x,y). This joint probability can be split into two 
parts as 


Pr(a, y) = Pr(z|y) Pr(y) 


where Pr(a|y) and Pr(y) are obtained from Pr(z, y) 


Now update the joint state using (x, y) Eee), (a, y) 


where x’ is a sample that is drawn from Pr(a2’|y), where 
Pr(z’|y) is a conditional probability that has the same 
dependence on its arguments as Pr(a|y) above. The joint 
probability Pr(a’, y) of the updated joint state is then 


Pr(x', y) = Pr(x'|y) Pr(y) 


Comparing Equation 2.5 with Equation 2.6 shows that 
the new joint probability Pr(a#’, y) is the same function 
of its arguments as the old joint probability Pr(a, y), by 
construction. This would not be the case if the sample 2’ 
was drawn from a Pr(z’|y) that did not have the same 
dependence on its arguments as Pr(a|y) above. 

The above argument shows that if you have a network 
whose joint probability is Pr(x, y), and assuming that 
the network starts in an initial joint state (a, y) that has 
joint probability Pr(a, y), then updating the joint state 


using (x, y) peiGaEIR (a’,y) guarantees that the new 


joint state (x2’, y) has joint probability Pr(a’, y) (which 
has the same dependence on its arguments as Pr(z, y)). 
Thus the joint probability of the joint state of the net- 
work nodes maps to itself under the update prescription 


(2.5) 


as Pr(aly) = 


(2.6) 


Pr(2'ly) 
(a, y) “, (2x', y)- 


Typically, a sequence of updates is applied, where the 
joint state of the network is split into two parts in dif- 
ferent ways for successive updates, so that eventually all 
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the nodes in the network are visited for updating. The 
overall effect is that updating causes the network to move 
around in the joint state space of its nodes, whilst guar- 
anteeing that the joint probability of the network node 
states stays the same. 

On the other hand, if the initial joint state (a, y) does 
not have joint probability Pr(a, y), then Pr(a’, y) and 
Pr(z,y) will not be the same functions of their argu- 
ments, so the joint probability will change as the updat- 
ing scheme is applied. If a sequence of updates (using a 
variety of splittings of the network of nodes, as described 
above) is applied then this evolution can converge to a 
fixed point where the joint probability is stationary under 
updating. However, convergence to a unique fixed point 
is not actually guaranteed, because an inappropriate up- 
date prescription could be used that leads to non-ergodic 
behaviour where the whole joint state space is not ex- 
plored, for instance. However, in practical problems with 
soft joint probabilities convergence usually occurs. 

In an MRF the ratio of conditional probabilities 


poze that is used to generate the MCMC updates 


Pp v 
(x, y) eaiGALss (a’, y) is given in Equation 2.4. If the 
joint states 2’; and 2’ differ in only a few of their vector 
Pr(w’ily) 
Pr(w’2|y) 


is relatively 


components, then there is a lot of cancellation in 
Pr(w’i|y) 
Pr(w’2/y) 
simple. This is what makes MCMC algorithms so appro- 
priate for MRF networks. 


so the fully simplified expression for 


C. Inference Using an MRF 


Image processing is an area where MRFs have proved 
to be particularly useful [4]. The starting point is to 
define an MRF model of the joint probability Pr(a) of 
the image pixels 


Pr(z) = Se Pre sy) 


Pr(w,y) = Pr(wly) Pr(y) (2.7) 
where Pr(a) is expressed as the marginal probability of 
Pr(a, y) after the hidden variables y have been averaged 
over, and both Pr(a|y) and Pr(y) may be written as 
products of factors using the HCE in Equation 2.1. The 
hidden variables y are the unobserved causes that de- 
termine the values of the image pixels x, and are thus 
the causal factors that are used to construct a genera- 
tive model of the image. This generative model can be 
multi-layered with several levels of hidden variables. 

To compute the probability of the joint state of the 
hidden variables y given an observation of the image pixel 
values x the posterior probability Pr(y|a) must be used, 
which may be obtained using Bayes theorem as 


_ _ Pr(wly) Pr(y) 
Prtul®) = 5 “pi (ely) Pry) 


(2.8) 


An MCMC algorithm (see Section IIB) can then be used 
to draw samples from Pr(y|x). Note that successive sam- 
ples produced by the MCMC algorithm are strongly cor- 
related with each other because the MCMC algorithm 
has a finite memory time; this makes MCMC run times 
(for a given size of error bar) much longer than would be 
the case if the samples could be somehow independently 
drawn from Pr(y|a). 

Also, if Pr(y|x) has a single well-defined peak of prob- 
ability, then the MCMC algorithm can be used to locate 
this, usually with the assistance of a simulated annealing 
algorithm to “soften” Pr(y|a) during the early stages of 
the algorithm, and then MCMC fluctuations about this 
peak can be observed in order to deduce the robustness 
of the solution. 

Typically, in image processing applications there is a 
single overwhelmingly likely hidden variables interpreta- 
tion of the image pixels (i.e. Pr(y|a) has a single well- 
defined peak of probability). However, the above ap- 
proach gracefully (and consistently) degrades when the 
interpretation is ambiguous (i.e. Pr(y|a) does not have 
a single well-defined peak of probability). This grace- 
ful degradation in the face of ambiguity is one of the 
strengths of the Bayesian approach. 


D. Multiply Occupied States 


It is useful to develop a concrete way of visualising the 
hopping processes that underlie the MCMC algorithm 
described in Section ITB. This is a prerequisite for the 
generalisation of MCMC algorithms that developed later 
in Section III. 

The state a2 of an N-node MRF is 2 = 
(@1,%2,---,2n), and for a given & each of its compo- 
nents x; lives in one of an assumed finite number m of 
states that are available to x;, where for simplicity we as- 
sume that all the 7; have the same number of states m. 
One way of representing each x; is as an m-component 
vector (0,0,--- ,0,1,0,--- ,0,0), where the “1” identifies 
which of the m states x; happens to have. This repre- 
sentation is essentially a histogram with m bins, with 
a single sample occupying one of the bins. The whole 
state of the N component x vector is then represented 
by N such histograms, each with a single “1” placed in 
the appropriate bin to identify the state of all of the x; 
for i = 1,2,---,N. Naturally, this use of histograms is 
an exceedingly wasteful coding of the state x because it 
consists mostly of “0” entries. However, it does allow the 
hopping operations that are generated by the MCMC al- 
gorithm to be represented directly as operations in which 
each “1” hops around between the bins of its histogram. 
More importantly, this representation of the MRF state 
is suitable for the generalisation in Section III where each 
histogram will have multiple samples occupying its bins 
(i.e. multiple states will be recorded at each MRF node). 
This is discussed in more detail below. 

Figure 1 shows a Markov chain with 7 nodes (i.e. 
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Figure 1: Steps of an MCMC update of a Markov chain with 
N=7andm=7. 


N = 7), each of which has 7 possible states (i.e. m = 7). 
The state space of each node is represented by one of the 
rectangles, the particular bin that is occupied by a sam- 
ple is shown as a blob (the unoccupied bins are shown as 
dots), and the particular 2-clique interactions (see Equa- 
tion 2.1) that are activated by the occupied node states 
are shown as bold lines. 


1. The top row of Figure 1 shows a random initial 
state of the Markov chain. 


2. The middle row of Figure 1 shows that the sample 
in node 3 has been annihilated. This is the first step 
of an MCMC update, in which a node is chosen at 
random and its state is erased. 


3. The bottom row of Figure 1 shows that a sample 
in node 3 has been created. This is the second step 
of an MCMC update, in which a sample is created 
in node 3 whose state was previously erased in step 
2 above. The influence of the neighbouring nodes 
is used to probabilistically determine the state in 
which to create the sample, as described in Section 
IIB. 


Figure 2: Multiply occupied Markov chain showing a random 
state. 
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The histogram representation allows generalisations of 
the MCMC algorithm in which each MRF node is oc- 
cupied by more than one sample, when it is said to be 
multiply occupied. Figure 2 shows an example of this 
type of MRF state. 

It is important not to confuse multiply occupied states 
with other uses of state space: 


1. Histograms with more than one sample are not the 
same as ensembles of histograms each with one 
sample. This is because the former allow for the 
possibility that the MCMC algorithm can cause 
the samples to interact with each other, whereas 
the latter is a means of running multiple standard 
MCMC algorithms in parallel. 


2. Histograms with more than one sample could 
be viewed as having a single “super’-state that 
recorded as a single state the entire contents of the 
histogram bins, which would disguise the fact that 
the histogram was actually constructed out of sam- 
ples occupying the histogram bins. The higher level 
super-state description is mathematically equiva- 
lent to the lower-level description in terms of in- 
dividual samples, but it does not allow the devel- 
opment of detailed MCMC algorithms. We prefer 
to view the higher level super-state description as 
an interpretation that is used after the lower level 
details have been worked out using the techniques 
that are presented in this paper. 


In Figure 2 the histogram associated with each node con- 
tains more than one sample. Such multiple occupancy 
was not present in the basic MRF theory of Section ITA, 
so the detailed form of the MCMC algorithm of Section 
IIB must now be generalised. Multiple occupancy is ex- 
plored in detail in Section III using creation and annihi- 
lation operator techniques to hop samples between his- 
togram bins, which is achieved by annihilating a sample 
from one bin and creatng a sample in another bin, as 
illustrated in Figure 1. 

When more than one sample per histogram is allowed 
then various new types of processing become possible: 


1. The number of samples per histogram can be var- 
ied with time. This requires birth and death rules 
as well as migration (or hopping) rules for the his- 
togram samples. In this case the creation and anni- 
hilation operators would be applied in ways that do 
not enforce conservation of the number of samples 
in each histogram, so annihilation without subse- 
quent creation (and vice versa) are permitted op- 
erations. This is how “reversible jump” MCMC al- 
gorithms [5] might be implemented using creation 
and annihilation operators. 


2. The samples can interact with each other in com- 
plicated ways to form “bound states”, which would 
then behave like higher level “symbols” (i.e. sets 
of interacting histogram samples) that are con- 
structed out of “sub-symbols” (i.e. the histogram 
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samples themselves). This is illustrated in Figure 
3, Figure 4 and Figure 5 below. 


Figure 3: Multiply occupied Markov chain showing a tube-like 
joint state. 


Figure 3 shows a multiple-sample version of Figure 1 
that is more highly structured than the example shown 
in Figure 2. For illustrative purposes, the samples are 
now assumed to be in neighbouring states at each node 
rather than spread out at random; typically this would 
be the case for Markov chains whose properties are opti- 
mised to encode information in a topographically ordered 
way. The 2-cliques that then contribute typically form 
the tube-like joint state of activated 2-cliques shown in 
Figure 3. 


=-@a. ~ 
i+ 
a 
Figure 4: Multiply occupied Markov chain showing two par- 
allel tube-like joint states. 


> 
> 


Figure 4 shows another possibility that can arise with 
multiple sample occupancy, where the occupancy of each 
node splits into two separate clusters of samples, and 
where the probability factors associated with the 2- 
cliques is such that only node states that are both in 
the top half of the diagram are connected (and similarly 
for the bottom half of the diagram), so that there are 
no activated 2-cliques running between the top and bot- 
tom halves of the diagram (or at least the contribution 
of these is negligible). Effectively, this multiply occupied 
Markov chain has two completely independent Markov 
chains embedded within it, each of which has its own 
tube-like joint state of activated 2-cliques. This type of 
structure emerges in multiply occupied Markov chains 
that have a limited number of states available to each 
node of the chain, and which are optimised to encode in- 
formation topographically (which ensures that the tube- 
like joint states are localised in the node state spaces). 


This type of behaviour emerges when SON training 
methods are used, but it will not be discussed further in 
this paper. 


Figure 5: Multiply occupied Markov chain showing two par- 
allel “tube” states bound together. 


Figure 5 shows how Figure 4 can be modified if the two 
tube-like joint states have some node states in common, 
which binds the tubes together. An extreme version of 
this binding between tubes can occur if the situation is as 
shown in Figure 4, but additionally there are some weak 
interactions between the tubes. 


III. OPERATOR IMPLEMENTATION OF 
MCMC ALGORITHMS 


The aim of this section is to present a theoretical 
framework for expressing MCMC algorithms, which is 
based on operators that have very simple algebraic prop- 
erties, but which is nevertheless sufficiently flexible that 
it allows a large class of MCMC-like algorithms to be 
represented. 

Section HI A gives some background material that mo- 
tivates the use of MCMC algorithms as the primary 
means of building dynamical models for discrete net- 
works. Section III B introduces creation and annihilation 
operators for manipulating samples in multiply occupied 
network nodes. Section IIIC uses these basic operators 
to construct a composite operator for generating MCMC 
updates, Finally, Section IID summarises a diagram- 
matic representation of MCMC algorithms. 


A. Background 


The aim here is to rewrite the MCMC algorithm for 
running an MRF (see Section IIB) using operator alge- 
bra. This will allow the algorithm to be run in state 
spaces where the basic MCMC algorithm has not previ- 
ously been used, and will thus generalise the algorithm. 
Throughout this section the emphasis is on using the 
MCMC algorithm as the starting point for deducing the 
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properties of an MRF, so the MRF is viewed as corre- 
sponding to the equilibrium behaviour of a (stochastic) 
discrete-time dynamical system. Hitherto, the MCMC 
algorithm could be viewed as an artefact of a particu- 
lar way of sampling from an MRF, but here it is viewed 
as the way in which the MRF actually behaves. This 
moves slightly away from the original motivation for us- 
ing MRFs to model and manipulate joint probabilities 
for use in Bayesian calculations (see Section I), but this 
change of emphasis allows full advantage to taken of the 
flexibility of the MCMC approach, and in particular its 
generalisation to multiply occupied states. 

This jump to using discrete-time dynamical systems 
as the starting point for building models allows a much 
larger class of behaviours to be explored, including ones 
that do not have a corresponding HCE representation of 
the equilibrium behaviour (i.e. as a simple product of 
probability factors, as in Equation 2.1), or do not have 
a steady state equilibrium behaviour at all (e.g. a limit 
cycle rather than a limit point, etc). 

The MCMC approach models everything as part of 
a dynamical evolution process, where a static statistical 
model of the world is obtained by taking a snapshot of 
the evolution of the dynamical system. Those who in- 
sist on starting from a fixed graphical model based on 
the HCE (or a set of such models) might be disappointed 
that this is not the starting point that is used here. How- 
ever, they should note that the underlying process that 
generates their graphical model in the first place is ac- 
tually dynamical, and that their model merely describes 
the statistical properties through a time slice of this dy- 
namical process; in other words, their model describes 
only a marginal distribution. For instance, an MRF im- 
age model does not attempt to model the history of the 
dynamical processes that cause the (hidden) objects to 
eventually give rise to the observed pixel values. Analo- 
gously, all MRFs derive from a hidden dynamical process. 

The results presented in this section make use of cre- 
ation and annihilation operator techniques to generate 
the hopping processes that underlie MCMC algorithms, 
which allows MCMC algorithms to be written using a 
very compact notation. These operator techniques will 
be familiar to physicists who use quantum field theory 
(QFT) [9], and for the convenience of physicists the no- 
tation used here is the same as is used in QFT. Gener- 
ally, creation and annihilation operators can be used to 
generate birth and death processes (respectively), which 
thus increase and decrease the dimensionality of the state 
space (respectively), so this approach naturally lends it- 
self to describing processes that correspond to “reversible 
jump” MCMC algorithms [5]. 


B. Creation and Annihilation Operators 


In this section the mathematical development of the 
properties of creation and annihilation operators is delib- 
erately presented in an informal way, by expressing it in 
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terms of operations on the samples occupying histogram 
bins. This is to encourage a concrete and intuitive under- 
standing of how these operators act on samples, rather 
than to merely think of them as objects that have partic- 
ular algebraic properties. To a physicist who is familiar 
with the use of these techniques in QFT, the explanations 
will appear to be very long-winded and the derivations 
very cavalier, and to them we apologise. 


1. Multiply Occupied States 


The multiply occupied states described in Section II D 
can be manipulated by suitably defined creation and an- 
nihilation operators. 

Multiply occupied states can viewed as hsistograms 
with multiple samples occupying the histogram bins. 
These histograms can be represented thus: 


1. Empty histogram: |0). This represents the bins 
(an indeterminate number of them) of a histogram 
with no samples in any of the bins. The notation 
|0) has been chosen to correspond exactly to the 
“vacuum” state as used by physicists; it represents 
the background in which we will create and anni- 
hilate histogram samples (or particles). 


2. Histogram with one sample in bin i: a,' |0). The 
|0) represents the empty histogram (as defined 
above), and the creation operator a;' acting from 
the left represents the action of creating one sam- 
ple in bin 7 of the empty histogram. The notation 
a;' has been chosen to correspond exactly to the 
operator for creating a particle in state i as used 
by physicists, and the notation a;' |0) corresponds 
exactly to the notation for a single particle in state 
i. The use of the dagger notation { (i.e. adjoint op- 
erator) is chosen to make our notation compatible 
with that used in QFT [9], which will be discussed 
in more detail in Section IIIB 8. 


3. Histogram with n; samples in bin i: (a;')”" |0). 
This is a multiply occupied histogram, which is 
obtained by operating on the empty histogram |0) 
multiple times with the creation operator a;'. 


4. Histogram with n; samples in bin 7 (for i = 
1,2,---,m): J], (ait) |0). This is a straight- 
forward generalisation of the above, where creation 
operators are applied multiple times to all of the 
histogram bins. 


The above representation of histogram states does not 
provide a means for freely manipulating them. In order 
to be able to do this it is necessary to be able to annihilate 
samples as well as create them as above. 
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2. Creation and Annihilation Operators 


The annihilation operations discussed below may be 
achieved by using the annihilation operator a; which is 
the adjoint of the creation operator a;'. See the dis- 
cussion on adjoint operators in Section IIIB8 for more 
details on why the creation operator a;! and annihilation 
operator a; are adjoints of each other. Note that in the 
description immediately below the behaviour of a;' and 
a; corresponds to our intuitive notion of how these oper- 
ators should behave, rather than formally derived from 
their algebraic properties which are presented later on in 
Section IIIB3. 

Annihilating a sample from an empty histogram erases 
the state space itself. This simply defines what happens 
when you try to remove a sample from an already empty 
histogram, which is very useful for cleaning up algebraic 
expressions involving a; and |0). In effect, this defines 
the “vacuum” |0) as the reference state for determining 
the occupancy of each histogram bin. 


which can be represented for a 4-bin histogram for any 7 
as 


(0,0,0,0) “4 0 (3.2) 


Annihilating a sample from a 1-sample histogram 
leaves an empty histogram. ‘This definition is the 
common-sense notion of what should happen when you 
create a sample in a histogram bin, then annihilate it 
again. Thus 


a; a;' |0) = |0) (3.3) 


which can be represented for a 4-bin histogram and for 
i=3 as 


Git as 
(0, 0, 0, 0) — (0,0, 1,0) = (0,0, 0, 0) (3.4) 


Annihilating the wrong sample (i.e. 7 4 7) from a 1- 
sample histogram erases the state space itself. This is 
a generalisation of Equation 3.1 in which the histogram 
already contains one sample, but it is in a different bin 
from the one from which we are trying to remove a sam- 
ple. 


a; a;! |0) =0 Piet 
which can be represented for a 4-bin histogram and for 
i7=3 andj Fias 


(3.5) 


a;t aj 
(0,0,0,0) “+ (0,0,1,0) “4 0 (3.6) 


Equation 3.4 and Equation 3.6 can now be combined to 
give (the illustration shows the 7 = 3 case) 


ag a; (0,0,0,0) j=i 
(0,0,0,0) “++ (0,0,1,0) > 4 ae 
(3.7) 
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If the location of the occupied bin is unknown, yet you — of both Equation 3.3 and Equation 3.5. Note that |0) 
want to be certain that you annihilate the sample, then (the empty histogram) is different from 0 (no histogram 
you have to attempt to annihilate a sample from every at all, ie. not even an empty one). 
one of the histogram bins. This combines the properties 


= a; | a;' |0) = aya," |0) + aga; |0) +--+ + aja," |0) +--+ + ama," 0) 


I 


OO shes FO 0) Oe 
= |0) (3.8) 


which can be represented for a 4-bin histogram and for 7 = 3 as 


(0,0,0,0) 2 (0,0,1,0) == (0,0,0,0) (3.9) 


Annihilating a sample from a 2-sample histogram (samples in different bins, i.e. i; 4 i2) leaves two 1-sample 


histograms. This is a generalisation of Equation 3.8 in which the histogram starts with two samples (known to be in 
different bins) rather than one sample. 


ye fig ae, |0) = a estas 0) Tet Gi; Ga; ae 0) pen gees + Gig ies ia! 0) +--+ + am di! aig! |0) 


O+---+0+a;,' (0) +0+---+0+a;,' |0) +0+---+0 
ai,* |0) + ain* |0) (3.10) 


which can be represented for a 4-bin histogram and for (71, i2) = (1,3) as 


fa» “*(15,050;0) 
(0,0, 0,0) 2s 4 (1, 0,020) 25 (1, 0,1,0) =" + (3.11) 
(0,0, 1,0) 


Annihilating a sample from a 2-sample histogram (samples in the same bin, i.e. 7; = ig = 7) leaves two copies of 
the same 1I-sample histogram (because either of the two samples can be annihilated to leave one sample). This is a 
variation of Equation 3.10, and it is the first example of attempting to annihilate a sample from a bin that has more 
than one sample in it. The number of ways of annihilating a sample from a multiply occupied bin is equal to the 
number of samples in the bin. 


S- aj (a;t)” |0) = a (a;*)” |0) + ag (a;t)’ |0) a ee ay (a;t)’ |0) i CS SR Am (a;*)” |0) 


0+0+---+0+2a;' |0)+0+---+0 
2a," |0) (3:19) 


I 


which can be represented for a 4-bin histogram and for 71 = 1 as 


ys. “KA O,0,0) 
(0, 0, 0 ,0) “4 4 (1, 0,0, yee 4 (2, 0,0,0) => + (3.13) 

(1,0, 0, 0) 
3. Creation and Annihilation Operator Commutation in a position to guess what their general algebraic proper- 
Relations ties should be, so that we can do arbitrarily complicated 


operator manipulations on states of arbitrary occupany. 


Now that some of the required properties of creation 


and annihilation operators have been established, we are 
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All of the above behaviour of creation and annihilation 
operators (apart from a;|0) = 0 in Equation 3.1) can be 
summarised in the following commutation relations 

ay a," _ a," Qa = Oi, 
aj a; — a; Qa = 
a;! a," = a;t a;! = 0 (3.14) 
where 6;,; is a Kronecker delta (6;,; = 1 if i = j, and 
6i,; =O ift #7). These commutation relations are usu- 
ally written in shorthand notation as 


[ai,a;"] = 6, 


[ai, aj] = 
[a:', aj] = 0 (3.15) 
The [a;, aj] = 0 and [a;t, a;"] = 0 commutation re- 
J 
a;t at 
(n1,N2,-°-) —- (n1, Na, fie a dyes) atk 
a; a;* 
(n1,Ne2,°° ) —>, ni (N16 Ni 1, ) 


and by taking the difference of the a; a;" (i.e. the first line 
in Equation 3.17 above) and the aj! a; (i.e. the second 
line in Equation 3.17 above) results above the commu- 
tator relation [a;, aj;'] = 6;,; is correctly verified. The 
key result is the i = 7 case in Equation 3.17 which has 
a factor n; + 1 in the a; a;! case and a factor n; in the 
a," a; case, which arises because the number of ways of 
annihilating a sample is equal to the number of samples 
in the histogram bin which the annihilation operator acts 
upon, and this number is one greater in the case where 
a creation operator got to act on the bin before the anni- 
hilation operator got its chance to act on the same bin. 


Note that the commutation relation in Equation 3.15 
extends the properties of the creation and annihilation 
operators independently of the states that they act upon, 
so that the operators now have specific effects on his- 
tograms with multiple samples in multiple bins; these ex- 
tended properties were not specified in the development 
up as far as Equation 3.13. Thus the particular choice of 
commutation relation in Equation 3.15 defines a specific 
set of combinatoric factors for how one can select samples 
for creation and annihilation, which are described above 
and which have intuitively reasonable properties. 


The above properties of the creation and annihilation 
operators have been justified by appealing to simple op- 
erations on the samples in histogram bins, which leads 
automatically these operators having the same combina- 
toric properties as the creation and annihilation opera- 
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lations follow from the fact that a sequence consisting 
solely of annihilation operators (or solely of creation op- 
erators) has the same effect whatever the order in which 
the operators appear in the sequence. However, this or- 
der independence property vanishes when the sequence 
contains interleaved creation and annihilation operators, 
as will be explained below. 


The [a;, aj'] = 6;,; commutation relation may be illus- 
trated for a 4-bin empty histogram and for j = 3 as 


a;" a; (0,0,0,0) t=4 
0,0,0,0) —> (0,0,1,0) —> ae a ames 
Co Cele G om 
(0,0,0,0) > 0 = 0 
(3.16) 
and for the general histogram as 
(ne +1) (r1,m2,-**) i=j 
Ny (M00 Ni 1, yn; +1, ) tAq 
(3.17) 
i (1, N2,-**) t=Jj 
i(mi,--+,ni — 1, ,nj +1,- ) t#j 


tors that are used in a QFT of bosons [9]. 


4. Commutation Relations Generalise MCMC Algorithms 


In Section IIB3 a set of commutation relations was 
defined based on the required properties of the creation 
and annihilation operators in a variety of simple cases 
that were discussed in Section IIB2. However, these 
commutation relations do more than just summarise 
these special cases, they extend the use of creation and 
annihilation operators to all situations, including cases 
where the histogram bins are occcupied by an arbitrary 
number of samples. Thus these commutation relations 
provide an algebraically simple route to generalisation of 
MCMC algorithms. No doubt there are other generali- 
sations of the standard MCMC algorithm, but none of 
them will have the algebraic simplicity of the properties 
defined in Section ITB 3. 

For instance, consider the multiply occupied state 
(ayt)"* --- (amt)”™ |0). As in QFT [9], the creation op- 
erators can be used to construct a Fock space of states 
with all possible occupancies, and this Fock space can be 
explored by applying creation and annihilation operators. 
This type of exploration corresponds to what is done in 
reversible jump MCMC algorithms [5], where the scope 
of MCMC updates is extended so that they sample from 
various models, in additional to the sampling within a 
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single model that usually occurs. 

It can be seen that the effect of yet a; is to count 
the number of samples in each histogram bin (i.e. the 
number of ways of annihilating a sample from a bin is 


The above deficit of one sample after the application 
of pe a; can be rectified by altering the operator as 
iii ay —> FL a; aj, because the inclusion of a;' to 
the left of a; ensures that a sample will be created in bin 
j to make up for the one that a; annihilated. Note that 
there is only one way of creating a sample in a bin, but 
there are as many ways of annihilating a sample as there 
are samples in the bin. 
The result in Equation 3.18 can be summarised as fol- 
lows for n > 1 (note that the r.h.s. is 0 for n = 0) 
n n—-1 
ay (a;") |0) = n Oi; (a;') |0) (3.19) 
which can be represented for a 4-bin histogram and for 
j=sas 


(a")” 


a; n(0,0,n—1,0) ti=f7 
(0,0, 0,0) ———+ (0,0, n,0) —> 0 uly 
(3.20) 


This result may be used in general to move annihilation 
operators to the right of all creation operators. The result 
in Equation 3.19 is easily proved by using [a;, a;'] = 6;,; 
to progressively move a; to the right through one a,! at 
a time, and then using a; |0) = 0 to discard any terms 
that contain a; |0). 


5. Doing Calculations with Creation and Annihilation 
Operators 


Using explicit notation (e.g. (0,0,0,0) a (0, 0, 1, 0)) 
for what the creation and annihilation operators are do- 
ing to the samples in the histogram bins is very tedious in 
cases that are not much more complicated than the ones 
discussed above. The purpose of introducing creation and 
annihilation operators is to replace the manipulation of 
histogram samples by algebraic manipulations based on 
the properties a; |0) = 0 and [a;, aj'] = 4;,;, which also 
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equal to the number of samples in the bin), and to also 
annihilate one of the samples from each bin, as is shown 
in Equation 3.18. 


(3.18) 


has the desirable side effect that the calculations can be 
completely automated by using symbolic algebra tech- 
niques. In general, explicit notation should be needed 
only to verify what is being done to the samples in the 
histograms, and to check that this corresponds to what 
was intended. 


From a theoretical point of view the commutation rela- 
tions in Equation 3.15 are an algebraic way of doing the 
book-keeping to keep track of how creation and annihi- 
lation operators construct and modify histogram states 
depending on the order in which the operators are ap- 
plied. The [a;, aj'] = 6;,; commutation relation can be 
written in the form a;a;' = a;! a; + 64,3, which can then 
used to replace a;a;' by a;' a; + 6;,;, which effectively 
moves the annihilation operator to the right (giving the 
a;'a; term) whilst picking up a commutator (the 6;,; 
term) as a side effect. This says that annihilation after 
creation (i.e. a;aj') is the same as annihilation before 
creation (i.e. aj‘ a;), except for when the operators are 
applied to the same bin, which triggers the appearance 
of the 4;,; term for reasons discussed above. 


As a manual exercise, it can be verified that oper- 
ators with the above properties (i.e. a; |0) = 0 and 
[a:, aj'] = 46;,;) correctly annihilate a sample from a 
2-sample histogram (samples in any bins); this gener- 
alises Equation 3.12 to the case where the bins are not 
assumed to be the same. The strategy in this deriva- 
tion (and in all other derivations using creation and an- 
nihilation operators) is to move the annihilation oper- 
ators to the right of all the creation operators (using 
Qj a;t = a;t a; + 6;,;), thus generating a sum of terms 
of the form (a! at at at ---)(aaaa ---) |0), and wherever 
there is a non-zero number of annihilation operators act- 
ing on |0) the term may be removed (using a; |0) = 0). 
This leaves a sum of terms that contain only creation 
operators acting on |0). 


The detailed derivation of the effect of applying 
poe a; to a;,'a;,"|0) is as follows 
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m m 
y aj ai, ain! 0) _ a; ai, | 
j=l 
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ain! |0) 


m 
= Ss" ( (ai,' a; + 6i,,5) ai,! |0) 


= S- (ai, (a; ai.) + 5:5 aig") |0) 


= x (a4, (aig! aj + 5ig,5) + 55,9 O45") |) 


m 
= a (a4, ai.! a5 + 5in,5 G,' + 45,5 Gin") |0) 


= >> (ai! a:,* (a; |0)) + 


— ai! 0) + ain! 0) 


After this sort of manipulation has been done a few times 
it is not necessary to write down all of the intermediate 
steps as above, because the manipulations have a very 
simple form where each annihilation operator a; is moved 
freely to the right, except that whenever it passes through 
a corresponding creation operator a," an additional term 
is created (i.e. the 6;,; commutator term). In more com- 
plicated cases it is more convenient to replace manual 
manipulations with symbolic manipulations. 


where the total number of histogram samples n = ni + 

n2+-+-+nm is the quantity that is measured by applying 

N. For instance, NV (a;')"’ |0) can be represented for a 

4-bin histogram and for 7 = 3 as 

(as7)" N 
(0, 0, 0, 0) as aoe (0, 0, 5, 0) —> 15 (0, 0, N55 0) 

(3.24) 

The structure of V in Equation 3.22 makes it clear how to 


define the number operator N; for bin 7 of the histogram, 
so that NV = 571", Nj; where N; is defined as 


Ni, = a;' a; (3.25) 
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5in,g (ai,' |0)) + 5:,,5 (ai. |0))) 


(3.21) 


6. Number Operator 


The above results (e.g. see Equation 3.18) allow the 
definition of a number operator N that counts the total 
number of samples in the histogram. Thus 


(3.22) 


m 
N= oS a; a; 
i=1 


This gives 


bess tm) (art)™ (agt)"? «+» (amt)"” |0) (3.23) 


and N; (a;')"’ |0) may be represented for a 4-bin his- 
togram and for j = 3 as 


(a;1)"9 N; nj (0,0, n,;, 0) 1 =. 
(0,0, 0,0) ~——+ (0,0, n;,0) —> 0 iA 
(3.26) 


7. Orthogonality and Completeness 


The states constructed using the creation operators de- 
scribed above are orthogonal and complete. Consider the 
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general histogram state (a,')"* (agt)"? --- (amt)”” |0) 
and attempt to annihilate its samples. The strategy of 
the proof will be to demonstrate that there is a unique 
set of annihilation operators that you have to use in order 


ay (at) (agt)™ om 
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to recover the empty histogram state |0). 


Apply a single annihilation operator a, (using Equa- 
tion 3.19 to move it to the right) 


Now apply the same annihilation operator n, — 1 more times to eventually obtain 


(a,)" (a,t)™ (at) ss (a 


Repeat this pattern of annihilation successively for bins 2, 3,--- 
(am) +++ (az)? (ax)™ (art)" (aat)™ + 


where the resulting state is (proportional to) the empty 
histogram |0). 

Thus we recover the empty histogram by applying ex- 
actly those annihilation operators to the histogram that 
correspond to the creation operators that we used to con- 
struct the histogram in the first place. The fact that 
the empty histogram can be recovered only by applying 
the same set (m1,72,°-+: ,%m) of annihilation operators 
as creation operators means that the states are orthogo- 
nal, and the fact that all possible states are constructable 
using the appropriate set (n1,2,--: ,%m) of creation op- 
erators means that the states are complete. 

The constant of proportionality n,!ng!--- nm»! is 
the number of ways in which the annihilation op- 
erators can annihilate the histogram samples, which 
corresponds to the total number of ways of permut- 
ing the samples within the histogram bins (but not 
permuting between bins). If this permutation factor 
is not required then the states could be defined as 
See (ait) (aah) ++ (amt) |0), and a simi- 
lar normalisation factor eS should be included 
with the annihilation operators when this whole state is 
to be annihilated. It is a matter of tast whether the 
normalisation factor is included along with the state, or 
whether it is not included but is then subsequently di- 
vided out from the results of calculations. 


8. States and Adjoint States 


The above results on orthogonality and completeness 
can be written more rigorously by introducing the ad- 
joint state. Intuitively, the adjoint state is obtained by 


ie) 11,V2,°* Um Oninays ie Ony,r1 Onk vs nite 


where the result in Equation 3.29 is used, and where 


(ae |0) =n (ait) (at)? _ (ant) |0) (3.27) 
ee |0) = ny! (at) oe (aat\"" \0) (3.28) 

,m of the histogram to obtain 
(amt)"” 0) = ny! n9! «++ Men! |0) (3.29) 


time-reversing everything, so that instead of making op- 
erators act to the right (with operators that act later 
being placed further to the left), the operators in an ad- 
joint state act to the left (with operators that act earlier 
being placed further to the right). Note that between 
these two viewpoints the time order of operator action 
corresponds to the order in which the operators appear 
in the “operator product”. Also note that a creation op- 
erator acting to the right (i.e. create a sample as time 
increases, as in a;' |0)) behaves in the same way as an 
annihilation operator acting to the left (i.e. annihilate 
a sample as time decreases, as in (0| a;! = 0). In this 
case a;' |0) says (reading from right to left) that there is 
an empty histogram in the distant past which later has 
a sample created in bin i, whereas (0| a;' says (reading 
from left to right) that there is an empty histogram in 
the distant future which earlier has a sample annihilated 
from bin i to give 0 (i.e. (0| a;t = 0). 

Introduce a notation for a histogram with occupancies 
Castiglia) 


Ons an (a,t)™ (a2t)" sie Gar: \0) (3.30) 


and its adjoint state for creating a histogram with occu- 
pancies (n1,72,°-: ,%m), but done in the reversed time 
sense where there is an empty histogram in the far fu- 
ture, which is then populated as we move backwards in 
time 


Ol na nay mm = (Ol (am)"™ +++ a2)" (a1) (3.31) 


The orthogonality property can then be stated as 


(3.32) 


Oram Ym TH! 22! +++ Mm! 


(0| |0) = 1 is defined. The completeness property then 
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corresponds to the following resolution of the identity 
operator 


1 
S- ! ! ! On1,n2,-- Mm Ol tas Nm — 1 
1:12: *+* lm: 


N1,N2,°"° ,Nm 

(3.33) 
where the states that this operator acts upon are assumed 
to be constructed in the same way as On, ,no,--snm (ie. 
using creation operators). 


9. Summary of Useful Results 


1. Creation operator for bin i: a;'. When applied to 
a histogram state this creates one sample in bin 7. 


2. Annihilation operator for bin 7: a;. When applied 
to a histogram state this annihilates one sample 
from bin 7 in as many ways (i.e. n;) as there are 
samples already in bin 7. The result is n; copies 
of the histogram state with one sample annihilated 
from bin 7. This includes the special case n; = 0 
where the histogram is annihilated altogether to 
give 0. 


3. Annihilation operator for all bins: S>j", a;. This 
produces a generalisation of what a; alone does. 
For each i (i = 1,2,--- ,m) the result is n; copies 
of the histogram state with one sample annihilated 
from bin i, which gives a total of S77", nj; his- 
tograms. This operator is useful for preparing a 
histogram for an MCMC update because it removes 
a sample at random from the histogram (i.e. it pre- 
pares 571", n; copies of the histogram in each of 
which a different sample has been annihilated). 


4. Annihilate an empty histogram: a; |0) = 0. This 
defines the “vacuum” state as a reference state for 
determining the occupancy of each histogram bin. 
This definition is very useful for removing terms 
that do not contribute to the overall histogram 
state. 


5. Creation/annihilation commutator: [a;, aj'] = 6;,;. 
This summarises the basic interaction between the 
creation and annihilation operators. It is mainly 
used in the form a; a,t = a;! a; + i,j to move anni- 
hilation operators to the right of creation operators, 
which eventually brings the annihilation operators 
so that they act directly on |0), where they can be 
removed (using a; |0) = 0). 


6. Annihilation/annihilation and creation/creation 
commutators: [a;,aj;] = 0 and [a,;', a;t] = 0. 
These summarise the fact that a sequence consist- 
ing solely of annihilation operations (or solely of 
creation operations) has the same effect whatever 
the order in which the operators appear in the se- 
quence. 
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7. Moving an annihilation operator to the right: 
a; (a;t)” |0) = n6;;(a;t)” |0): This is the ba- 
sic result that is used to remove annihilation oper- 
ators from expressions. The a; is moved progres- 
sively to the right through the a;' (using a;a;' = 
a;' a; +6;,;) until it reaches the |0), where it is dis- 
carded (using a; |0) = 0). 


8. Number operator for bin i: Nj; = a;' a;. This an- 
nihilates then creates a sample in bin 7. Because 
there are n; ways of annihilating a sample but only 
1 way of creating a sample, the net effect is to count 
the number n; of samples in bin 2. 


9. Total number operator for all bins: MN = 
ye, ai! a;. This counts the total number of sam- 
ples in the histogram. This follows directly from 
N;, = a;' a; above. 


10. State and adjoint state: Oninaena = 
(ayt)™ (agty™ +++ (amt) |0) and 
On nay hm a (0| (@m)"" +++ (az)"? (a1) 
(respectively). The adjoint state can be ap- 


plied to the left of a state and the annihila- 
tion operators then moved to the right using 
n n—-1 
a; (aj')” 0) = n4j,3 (a;") 
orthogonality (assuming (0||0) = 1). 
of a; |0) = 0 implies (0| a; = 0. 


|0) to demonstrate 
The adjoint 


11. Orthogonality: ON ais Me a fri erte ts a 
Ony,01 One,v2 °°? Onm vm M1! Ng! +++ Nm!. 
(0| |0) = 1 is assumed by definition. 


12. Completeness: All states On, ino,--;nm, are 
constructable by using the appropriate set 
(n1,N2,°+* ,Mm) of creation operators. 


10. Multiple MRF Nodes 


The above results are for a single MRF node. When 
there are multiple nodes, each MRF node has it own set of 
creation and annihilation operators, which have all of the 
properties described above. Operators for different nodes 
commute with each other because they act on different 
state spaces, so the generalised form of Equation 3.15 is 


[at ai] = 655s, 


ar'7 
[a?, a5] = 0 
[ait,ait] = 0 (3.34) 


where s and ¢ are node indices. There are analogous 
generalisations of all the results in Section IIIB 9. 


C. MCMC Update Operator 


In Section ITD it was shown how the state of an N- 
node MRF can be represented as a set of N histograms 
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each of which contains one sample in one of the histogram 
bins, and how MCMC updates of the MRF can be rep- 
resented as hopping operations where each sample hops 
around between the bins of its histogram. The aim now 
is to use the creation and annihilation operators defined 
in Section IIIB to implement these MCMC hopping op- 
erations. 

The MCMC update operator H can be constructed in 
several easy steps, in which each MCMC hopping opera- 
tion is broken down into annihilation followed by subse- 
quent creation of a sample. 


1. Annihilate a sample (see the middle row of Figure 
1). Apply >", a; to the histogram state to an- 
nihilate one sample from each bin, which prepares 
yo ni copies of the histogram in each of which 
a different sample has been annihilated. The out- 
put of this operation is thus a linear combination of 
histogram states, where each state is weighted by 
the same factor of unity (i.e. all states are equally 
likely). This linear combination of S77, n; terms 
(of which only m are distinct) represents the en- 
semble of all the possible outcomes of annihilating 
one sample. 


2. Create a sample (see the bottom row of Figure 1). 
Apply 30", piait to each histogram state in the 
ensemble generated above, which prepares m copies 
of the histogram in each of which a different sam- 
ple has been created, and weight each of these m 
histogram states so that where the sample is cre- 
ated in bin 2 the state is weighted by a factor p;. If 
the p; satisfy pj > 0 and 7", p; = 1 then p; can 
be interpreted as the probability of creating a sam- 
ple in bin 7. Actually, the normalisation condition 
Se pi = 1 can be omitted because the relative 
size of the p; is all that is required. The output of 
this operation is thus a linear combination of his- 
togram states, where each state is weighted by the 
appropriate probability factor p; corresponding to 
the bin 7 in which a sample has just been created. 
This linear combination of m terms represents the 
ensemble of all the possible outcomes of creating 
one sample in one of the bins of a histogram. 


Concatenate these two operators to define the MCMC 
update operator H 


(3.35) 


where the action of )0%",aj; produces ¥/", nj_his- 
tograms, then the action of 7)", p; a;' on each of these 
ye ni histograms produces m histograms. Finally, all 
of these histograms should be regrouped so that multiple 
copies of identical histograms are represented as a single 
copy with an appropriate weighting factor. 

The weighting factor that is applied to the state (as 
used here) represents probability itself rather than prob- 
ability amplitude (as used in the corresponding QFT). 
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However, if a QFT is “Wick rotated” to become a Eu- 
clidean QFT then it is equivalent to quantum statistical 
mechanics [9], where the state is a probability-weighted 
mixture of states. So the approach discussed in this pa- 
per has a mathematical structure that is similar to the 
Euclidean version of a QFT of bosons. 

The pieces p; a;' a; of the MCMC update operator may 
be represented diagrammatically as 


. ag : 
jo— . — 1 
Pi tt 


source 


where state 7 comes in from the left and is annihilated by 
a;, and a new state 7 is created by a;' which then goes 
out to the right, and the probability of this transition 
occurring is p; which depends only on the output state (so 
it is memoryless), which is in turn generated by a source 
(e.g. MRF neighbours, external source, etc). The whole 
MCMC update operator H is the sum of this diagram 
over states ¢ and j. 

This result can be generalised to an MRF with N nodes 
(with node s having m, states) 


N ms Ms 
ea (3.36) 
s=1li=l j=l 


which can be written using the transition operator 7;°, = 


ait a that hops a sample from bin j to bin i at node s. 


Ms 


N 
H=>)>° vite 


s=li,j=1 


(3.37) 


In practice the creation probability p? depends (via a 
product of clique factors, as described in the discussion 
on the HCE in Section IIA) on the states of the other 
nodes in the MRF. This probability can be computed by 
applying an appropriately designed operator to the MRF 
node states. Thus use the number operator for bin k at 
node t (which is Nf = ai! at) weighted by Di to deter- 
mine the 2-clique contribution (i.e. pairwise interactions 
between nodes of the MRF) for creation in bin 7 at node 
s due to bin & at node t being occupied. This operator 
expression is appropriate for any number of samples in 
bin k at node t, because the number operator Nj auto- 
matically determines the number of samples as needed, 
and then uses this number to weight any clique factor 
that involves this node. 

This use of sample number to weight clique factors is 
consistent because it guarantees that a single sample at 
each node (i.e. standard HCE) is physically equivalent 
to the situation where each of these samples is cut into 
a number of equal-sized sub-samples, because the addi- 
tional factors then generated by the number operator ap- 
plied to these sub-samples are exactly cancelled by the 
additional factors then generated by the fact that inter- 
actions between sub-samples are proportionally weaker 
than interactions between samples. 


194 


Discrete Network Dynamics. Part 1: Operator Theory 


This allows p? to be replaced by an operator P?, which 
can be used to construct a p? based on whatever samples 
it finds in the histograms in the neighbourhood C(s) of 
node s of the MRF. 


me 
pi PF = TT opie 


tEC(s) k=1 


(3.38) 


This result should be compared with the product form 
of the HCE in Equation 2.1, where the [],eq,) (-+*) in 
Equation 3.38 corresponds to the [[.(---) in Equation 
2.1, and the sum over operators )>,"", (---) in Equation 
3.38 is needed to cover all the possibilities that might 
appear in the (---) inside [], (---) in Equation 2.1. More 
generally for 3-cliques the operator P? is given by 


a 


Mey Mts 


IL dd» 


t1,t2€C(s) ki=l ko=1 


Ss, trt2 ay Ne 


pp, PZ = Dirks) ko (3.39) 


which may be straightforwardly generalised to higher or- 
der cliques. 

Inserting the operator-valued version of p? into Equa- 
tion 3.37, the MCMC update operator H becomes (using 
2-cliques only) 


N ms 
H-S°S° TS 1 [I yn (3.40) 
s=1i,j=1 tEC(s) 


with analogous expressions for higher order cliques. This 
operator-valued object # can be applied to any MRF 
state, whether it is a conventional single sample per node 
state, or has multiple samples per node. This is the key 
advantage of using operators, because they are effectively 
general procedures (e.g. algorithms) that can be applied 
to any state that is constructed using creation operators. 
The algebra of the creation and annihilation operators 
provides a unified framework for handing all of these pos- 
sibilities consistently. 

The functional form used in Equation 3.40 is enforced 
by backward compatibility with the MCMC update op- 
erator for an MRF shown in Equation 3.36, where the 
factor p? is a product of clique factors that intersect with 
node s (i.e. for 2-cliques only, it is generated by the 
Tirecs) en Pee i N¢ factor in Equation 3.40). However, 
the framework developed here allows for any functional 
form built out of creation and annihilation operators, so a 
very large class of update operators H can be constructed 
such as: 


1. The operator that generates the product of 
clique factors [Txec(s) Si Pie NE can be re- 
placed by some other fanchionl form, such 


as a non-linear sigmoid squashing function 
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o(Dreote pee Di i 


“neural cece implementations of recurrent net- 
works. One possible way of viewing the relation- 
ship between this non-linear sigmoidal version and 
the clique product can be obtained by perturba- 
tively expanding the sigmoid to obtain various pow- 
ers of its argument )7,¢¢() opi Pi, 7, Nf, which in- 
cludes terms that look like the original clique prod- 
uct TTicc(s We 1 Pi. _N{, plus other higher order 
terms. 


Ni), as is typically done in 


2. The hopping operator Te; = ast a; can be replaced 
by some other functional form, such as one that 
increases (i.e. birth) or decreases (i.e. death) the 
number of samples, which may be used to allow the 
update operator H to explore histogram states with 
various occupancies. Note that if this part of the 
overall update operator H is used alone as the up- 
date operator (i.e. without the clique factor piece 
above), then it can be used to generate the prior 
behaviour that the histogram state has before any 


interactions with other histograms are included. 


The effect of the creation and annihilation operators 
can be viewed in terms of elementary operations on his- 
tograms (as described in Section IIIB), and their opera- 
tor algebra can be used to do calculations in which H is 
applied to multiply occupied states to generate MCMC 
updates. It is also possible to use symbolic algebra to do 
these operator manipulations automatically. In general, 
the effect of the MCMC update operator H on a set of 
histogram states can be represented as a type of Feyn- 
man diagram, in which each vertex represents a product 
of operators acting on an incoming state to produce an 
outgoing state (if any), and a (weighted) sum of such 
diagrams represents the corresponding (weighted) sum 
of products of operators (note that here the weights are 
probabilities rather than probability amplitudes). 


Note that the MCMC update operator H in Equation 
3.40 is number-conserving in the sense that its transition 
operator LS =a; a aj; causes samples to hop from bin 7 to 
bin 7 at node 8, without gain or loss of the total number 
of samples at fede s. Formally, this property may be 
written as [H, V°] = 0 where V% = S~\"5 N53 is the 
total number operator at node s. This result dan be seen 
intuitively because it may be written as HN* = NSH 
which states that when you measure the total number of 
samples at node s then do an MCMC update, you get 
the same result as when you do an MCMC update then 
measure the total number of samples at node s, so there 
must be number conservation. 


The steps in the derivation of the number conservation 
property [H,.NV“] = 0 are as follows 
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HN“) = SOD IT 


2 
: 
, aS 
- 
M: 
SS 
= 
cs 
3 


s=1i,j=1 tEC(s) k=1 
N ms mt 
= mes[ TI Scottat|an—aere( TI Soettae 
g=1.7,3=1 tEC(s) k=1 tEC(s) k=1 
N ms, mt: 
= 35 (78, ( I] Seotiat) ae—agae( T] Shotta 
s=149=1 tEC(s) k=1 pie 1 
N ms mt 
= dd (MS | TT Dee |W" - 75 | UT Yo nint nN 
s=1i,j=1 teC(s) k=1 tEeC(s)k 
=a (3.41) 
using [N™, T;| = 0 (7%; causes hopping at node only a single sample, as in a standard MRF), but as the 


s but conserves total number at node s, and also manipulations become more complicated (e.g. subtle in- 
trivially conserves total number at all other nodes) terdependencies between histograms) it is better to do 
to make the replacement N“7;*; —> 7;°5;N“, and them by using this algebraic approach. 


N*, Nf] = 0 (number operators always commute) to 

’ k P y 

make the replacement VV“ (Theat, bas Pik Nj) — 

(Meets ) Hees Den) N*. Note that the fact that D. Diagrammatic Representation of MCMC 
[N“, 7;°;] = 0 and [N“, NZ] = 0 are simple to derive se lgersehens 

from the basic creation/annihilation content of the vari- 

ous operators. A sequence of MCMC updates (e.g. see Section IIB) 


The overall effect of using creation and annihilation in which z and y are alternately updated by sampling 
operators is to formalise the act of manipulating samples from Pr(a, y) is illustrated below where each arrow rep- 
in histograms, so that these manipulations are now rep- resents a dependency. The graph structure shows that 
resented algebraically. One could avoid the use of this the updates are memoryless. For instance, #2 depends 
algebraic approach (especially when each histogram has — on y, via Pr(a2|y,), but it does not depend on «1. 


Ly w2 —> v2 £3 —> £3 
Z. \ vA \ 
vy \ Via ~ 
w \ A \ 
Y1 =v Yi Yo To Yo Y3 
Pr(#2|y1) Pr(yo|@2) Pr(#3|Y2) Pr(y3|#3) 


The above diagram can be skeletonised by omitting all inessential labelling in order to emphasis the information flow, 
in which case the result looks like this 


Wee SS Wk, OE 


If this skeletonisation is used to draw an information flow diagram for a sequence of MCMC updates of a 4 node 
Markov chain, then a typical result looks like the diagram below. 
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_—e . . . . ——— . 
tl Ny ve \ 
eS . — oo: 
+2 \ tad \ 
_—>- . > > > > . . 
+3 we ve Si 
. ,o— . . . . ype 


For illustrative purposes the Markov chain is drawn in 
the up-down direction in the diagram, with the horizon- 
tal direction being used for the discrete time steps that 
are generated by the MCMC update procedure. The +n 
notation at the left hand side shows the labelling con- 
vention that is used for the update that occurs at each 
time step, where +n indicates an interaction between a 
node and its right hand neighbour (right is “down” in 
the diagram), and —n is the analogous notation for the 
left hand neighbour. The +n notation along the bot- 


—> -: — 
\ ye 
_ > - _ > - 
\ 
—_ _ > - . . 
. — . _ > - . — . 
+1 —1 +2 


The skeletonised structure of the diagrams can now be 
simplified further to make it look more symmetrical as 
shown in the diagram below, where the pieces of the 
above diagrams are drawn individually in more symmet- 
rical fashion. 


— > > 

— —_:— 
Ny => A 

— 

. — . —_ . — 
7 = t 

—_ . —. 


This reduces the description of the MCMC algorithm toa 
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tom of the diagram shows the actual update interaction 
that occurs at each time step. The particular sequence of 
MCMC updates that is represented in the diagram above 
is unimportant because it is random. 


There are 6 separate basic diagrams that are used to 
build the above diagram which are shown in the diagram 
below. Usually a randomly selected sequence of these 
diagrams forms the MCMC algorithm, but other choices 
are possible. 


— — — 
—> - _ 
A 
. —> . . . 
\N i 
. —_ . —> . . —> 
—2 +3 —3 


set of basic diagrams in which the state of a node evolves 


freely (ie. —+ - —+ ) or is involved in an interaction 


(ie. 3 | — and —” ‘ — ). These diagrams 
allow for the possibility that a node has a “memory” of 
its previous state (i.e. an arrow comes in from the left), 
so the MCMC diagrams above are a special case in which 


this memory is discarded. 


These diagrams can be used to represent higher order 
MCMC algorithms which amalgamate the effect of sev- 
eral basic MCMC updates. Thus, start by defining an 
MCMC update operator H. For a pair of MRF nodes 
this is illustrated in Equation 3.42, which is of the form 
H=T+H,+Hp2. The ZT is the “identity” which corre- 
sponds to no update occurring, and the H; and H2 pieces 
correspond to updates that occur on one or the other of 
the two nodes, respectively. 
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—— te { + t 


(3.42) 


Multiple MC updates may then be generated by iterating H to create powers of H. For instance, H? may be derived 
as by expanding out {Z + H1 + Ho}" and collecting together similar terms, as shown in Equation 3.43 and Equation 


3.44. 


H? = Ap + Ay + Ao 


where 
> > . 
Ao — 
> . . 
> > > > > > 
Ay = a + 1 
> > > > > > 
> > > > > 
Ag = + Af + sl t 
> > > > > 


The result in Equation 3.43 and Equation 3.44 may be 
simplified to Equation 3.45 and Equation 3.46 (using 
7s =T and TH; aed H,L = Hi). 


H? = Bo + Bi + Ag (3.45) 
where 
—. . — 
—. . —. 
— OO 
By = 2 AN 2 t 
— SO 


In the diagrammatic expression for H? in Equation 3.46 
the first row represents no interaction, the second row 
one interaction, and the third row two interactions. Note 
that the order in which the interactions occur is impor- 
tant (i.e. H1 He 4 He Hy in general) so the diagrams in 
the third row cannot be combined. On the other hand 
TH; = H;T =H; so the diagrams in the second row can 
be combined. 

These diagrams are actually Feynman diagrams, which 
describe operator expressions in an visually appealing 
way. In this case they show how the various operations 
invoked by the pieces of the MCMC update operator H. 
fit together in various ways to generate the diagrammatic 
representation of the higher order MCMC update oper- 
ator H?. This example is simple enough that the results 
are obvious, but the diagrammatic technique generalises 
to arbitrarily complicated cases. 


(3.43) 
(3.44) 
> mae > > > > 
+ t + t 
> oan > > > > 
> a 4 7 > > 
+ 4 t + i 1 
> aan > > > > 


IV. APPLICATIONS OF THE MCMC UPDATE 
OPERATOR 


The aim of this section is to show some simple practical 
uses of the operator approach that is described in Section 
III. No attempt will be made to do extensive computa- 
tions, because these will be presented in future papers in 
this “discrete network dynamics” series of papers. 

Section IV A illustrates how the MCMC update oper- 
ator correctly generates MCMC updates for histograms 
that are each occupied by a single sample, thus ensuring 
backwards compatibility between the operator approach 
and the standard MCMC algorithm for sampling MRFs. 
Section IV B generalises this to the case of multiply occu- 
pied states, and derives the equilibrium state of a single 
node MRF which has the same properties as ACEnet [6]. 


A. Update of Single-Sample States 


As a check on the result for H in Equation 3.40 verify 
that the application of H to a standard MRF state (i.e. 
one sample per node) leads to the expected standard 
form of the MCMC update. 

In a standard MRF only a single bin 2, is occupied at 
each node u. For an N-node MRF this defines a pure 
state U(i1,%2,--- ,in7) that has the form 


N 
Wi, i2,--- ,in) = (ita |0) 
u=1 


The first operator to consider in Equation 3.40 is the 


(4.1) 
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number operator Ni (for measuring how many sam- 
ples are in bin & at node t). When Vj is applied to 


W(t1,72,--: ,tn) it gives 
N 
NI (11 a!) 0) 
u=1 


5i,k U(t1,%2,°°: ,in) (4.2) 


Ni wv 


(i1, 2,-°° ,tn) 


II Yt 


t€C(s) 


which is equal to V(i1,%2,--- 


I 


I 
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so the number 4;, , is 1 if the bin at node t being exam- 
ined (i.e. &) matches the bin in which the sample at node 
t is to be found (i.e. 2), and is 0 otherwise. 


Insert this result into the [T,eo¢s) opt Pi 7, Nf part of 
H in Equation 3.40 to obtain the following simplification 


mt N 
Pie i (11 a!) |0 
tEC(s) k=1 u=1 
oe Oink W(i1, 22, tn) 
tEC(s) k=1 
II pe W(i1, 72, tn) (4.3) 
teEC(s) 


,in) weighted by the product of the 2-clique factors that involve node s. This result 


correctly computes the 2-clique influence of the neighbours of node s that is expected in a standard MCMC algorithm. 


H. in Equation 3.40 also involves the transition operator Ti; 


Te; Vi, tay 


4,5 ,in) 


N 
rs (1 a) ) 


T at, 


Re ee aj, 
j 


st j1t 2 
a; a ai, i 


st lt At, 


a; a;! a; asl . 


a 


= 6,3 TG eae 


where the annihilation operator aj is moved to the right, 
picking up a non-zero commutator only when it moves 
past the creation operator asl (i.e. both the creation 
and the annihilation are at the same node so they do 
not commute if 7, = j), and finally meets the empty 
state |0) which it annihilates. This result is equal to 
W(t1, 72,°+* ,ts—1,2,%541,°°:,¢n) weighted by a factor 


ts—1; ee 


Apply 7,2; to U(i1,72,-+- ,in) to obtain 


N 
aj! «++ aj, [0) 
. (az! aj + ne ve a 


(4.4) 


6i,,j, Which corresponds to a new pure state in which 
the sample at node s has hopped to bin 2, weighted by 
1 if the sample at node s started off in bin 7, and 0 oth- 
erwise. This is exactly the behaviour that is expected of 
the transition operator 7;%;. 

Finally, inserting the results in Equation 4.3 and Equa- 
tion 4.4 into H in Equation 3.40 gives 


Ms met 
ss ae - . S 8 5 st act sy * 
H W(ii,%2,--+ ,in) — Tie; II PinNe Wit, t2,-°- in) 
s=1lij=l tEeC(s) k=1 
N 8 
S 8,t . . . ayn < 
— Ois,j Pi ig Wir, %2,°°° yts—1,%,¢s415°°° in) 
s=1li,j=l tEC(s) 
N msg 
=) I] pitt, | WG ta teas tena yt) (4.5) 
s=li=1 tEC(s) 


199 


20 


The action of H on the pure state V(i1,i2,--- ,in) pro- 
duces a weighted sum of states (or mized state), be- 
cause the effect of # at each node s is to simultane- 
ously create m, states V(i1,22,--+ ,ts—1,t,¢4s41,°°+ tN) 
(for i = 1,2,---,m,), each of which has its own prob- 
ability factor [],<c,s) oe. (i.e. product of 2-clique fac- 
tors), which is a total of mym2---my states with their 
corresponding probability factors. Note that this ensem- 
ble of histograms should be regrouped so that multiple 
copies of identical histograms are represented as a sin- 
gle copy with an appropriate weighting factor. Thus 
H W(i1,i2,---,in) is precisely the ensemble of states 
from which the standard MCMC update algorithm draws 
its updated state. 

This verifies that the update operator H generates the 
correct behaviour when only a single bin 2, is occupied 
at each node u, as is the case in standard MCMC simu- 
lations of MRFs. Similarly, higher order cliques produce 
the same consistency between what the update operator 
H generates and what the standard MCMC algorithm 
generates, so the assumed operator form of H is back- 
wardly compatible with MCMC simulations of standard 
MRIs with a single sample per node. 

Standard MCMC algorithms randomly select a single 
state from the above ensemble of states generated by the 
action of the update operator H; the probability of a 
particular state being selected is given by the probabil- 
ity factor that weights that state in the ensemble. More 
sophisticated MCMC algorithms, known as particle fil- 
tering algorithms [3], select several states from the en- 
semble which allows several alternative updates to be si- 
multaneously followed, which allows the probability over 
alternatives to be represented in a sampled form. How- 
ever, all of these approaches fit into the same theoretical 
framework where the update operator H generates the 
full ensemble of alternatives. 

Note that pure states and mixed states are related to 
doubly distributional population codes [7]. Thus a pure 
state specifies a single joint state of the MRF nodes, 
whereas a mixed state specifies a range of alternative 
joint states of the MRF nodes. The operator algebra pre- 
sented in this paper provides a complete and consistent 
framework for using MCMC algorithms to manipulate 
these pure and mixed MRF states, or equivalently the 
corresponding doubly distributional population codes. 


B. Equilibrium Multi-Sample State 


The aim of this section is to demonstrate in detail that 
the MCMC update operator H = >>\", pi ait Dyes 2 
has an equilibrium state which has the same properties 
as ACEnet [6]. 

In Section IV A the application of H to a pure state 
W(t1,72,--+ , tn) converts it into a mixed state (see Equa- 
tion 4.5). The aim now is to derive the equilibrium mixed 
state that self-consistently maps to itself under the action 
of H. This would correspond to a mixed state that con- 
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tains exactly the right mixture of pure states to balance 
the hopping rates generated by H. In physics this is 
known as the detailed balance condition. When there is a 
single sample per node this equilibrium mixed state cor- 
responds to the equilibrium ensemble that the standard 
MCMC update algorithm seeks to generate. 

It is not possible in general to analytically derive this 
equilibrium mixed state; if it were then MCMC algo- 
rithms would not be needed. This intractability arises 
because the clique factors cause the samples at neigh- 
bouring nodes (i.e. nodes in the same clique) to interact 
with each other, which leads to the development of in- 
direct long-range correlations between nodes by cascad- 
ing together multiple direct short-range interactions (i.e. 
paths of influence are built out of interlinked clique fac- 
tors). The summation over all possible paths via which 
the nodes can interact indirectly with each other is not 
analytically tractable, except in simple cases such as 
when the nodes interact along a 1-dimensional chain (or 
any acyclic graph of interactions). More interesting cases, 
such as 2-dimensional sheets of node interactions, are not 
analytically tractable in general (although there are spe- 
cial cases that are exceptions, such as the 2-dimensional 
Ising model). 

One case which can be solved analytically is the case 
of an MRF with a single node that interacts with a fixed 
external source. In effect, this is an N-node MRF in 
which N — 1 of the nodes are frozen, and their influence 
on the single remaining (unfrozen) node is represented 
by the external source. This case is interesting because 
it is the model that is used in the simplest version (i.e. 
single coding layer) of ACEnet [6]; it is therefore prudent 
to use the operator methods developed in this paper to 
verify that the MCMC equilibrium state corresponds to 
the behaviour that is observed in ACEnet. 

The state space of a multiply occupied 1-node MRF 
is an n-sample histogram. The aim now is to derive 
the equilibrium state of an n-sample histogram under 
the action of repeated MCMC samplings generated by 
(i Seen iret j1 a; (see Equation 3.35), where the 
probabilities p; are derived from a fixed external source. 
The equilibrium mixed state V must satisfy the self- 
consistent bound state equation 


(4.6) 


a a;" (s«) Ww = AV 
q=1 1. 


where A is an eigenvalue. In other words the MCMC 
update operator must map the equilibrium state into a 
multiple of itself, as is expected of an equilibrium state. 
Because correct normalisation of the state and of the 
MCMC update operator have not been imposed (to avoid 
lots of distracting normalisation factors appearing in the 
mathematics), the eigenvalue is not the expected A = 1, 
but nevertheless the value of \ may be readily interpreted 
(see after Equation 4.15). 

The mixed state YW can be expanded as a weighted 
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mixture of pure states thus 


W= YS > P(ni,n2,--- mm) [] (aet)"* 0) (4.7) 
Nam k=1 


M1,N2,°""5 


where (a,')"* |) is (up to a normalising constant) a his- 
togram with n;, samples in bin k, [Tj (axt)"* |0) is (up 
to a normalising constant) a histogram with occupancy 
(n1,N2,°°+ ,Mm), W(N1, N2,°++ , Mm) is the probability (up 
to a normalising constant) of this histogram occurring, 
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and Sen eG --) is a mixture of such histograms. 
Note that it is not necessary to introduce the normalis- 
ing constants explicitly because all we are trying to do is 
to demonstrate that W is a solution of Equation 4.6. 


First of all, force the total number of samples to be 
constrained. In physicists’ terminology, the case with a 
fixed number of samples is a canonical ensemble, rather 
than a grand canonical ensemble in which the total num- 
ber of samples would be allowed to vary. Thus write U 
as 


v= S- Oni enadtn W(n4, Na, a , Mm) II (a,t)"* |0) 


N1,N2,°°* 5m k=1 


(4.8) 


where the Kronecker delta 6;.n,4n.4---+n, ensures that only terms in )> 
N=nytng+-+++Nm can contribute. 

Now find the state W that satisfies the consistency condition in Equation 4.6. First substitute Equation 4.8 into 
the left hand side of Equation 4.6 to obtain 


(---) that satisfy the condition 


N1,N2,°7° Am 


Yi, Ont inen UO alte) | Bye! (>. «) [] (@*)™ 10) (4.9) 
N1,N2,°°° jm g=l t=1 k=1 
Now use that a; (a;t)” |o) = nd;5 (a;t)"* 0) to move all of the annihilation operators to the right in the 


Oo By aj!) Oe Ty (a,')"* |0) part of the expression in Equation 4.9 to obtain the following simplifica- 
tion 


(+) |0) = SoD; a," eM (a;")" (a;*)™ . (Gy) |0) (4.10) 
= Spy [my (ant) += (ant) Joy + ma (ast) oe (ait) 2 (A) Cant)” JO) 


where the cases 7 = 7 (annihilation and creation within a single bin) and i # j (annihilation in one bin and creation 
in another bin, i.e. hopping) have to be considered separately. 

The contribution for a given final state 7 (but summing over the initial state 7) can be represented diagrammatically 
as follows 


i(# J) 
t oe 
ay aj m 
i(=j) = j ; 
jn; t + pj D Mi - Sj 
i=1 
source ; abe t 
source 
which is a sum of contributions of the form 
a 
aj 
\ 
Dj Ns . ea j 
tt 
source 
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where the overall factor of n; comes from the fact that the annihilation operator a; has n; samples to choose from in 


the initial state. 


The coefficients of corresponding contributions to the left hand side and right hand side of the equilibrium condition 
in Equation 4.6 can now be matched up. Note that this matching of coefficients is allowed because the set of states 
TT, (ax')"* |0) is orthogonal and complete (see Section IIIB7). This leads to the following consistency equation 


that interrelates the w(n1,2,--+ ,%m). 


Sop; nj P(m1,M2,-++ Mm) + Y- (mi +1) V(r, +++ mG A+ pj —1y+++ Mm) = p(n, N2,°+* , Mm) 


j=l 


Now define a trial solution to this equation (where n = 
my + nz +---+Mm) 


! 
n! ten 


(71, N2,°++ » Mm) = pi" po"? +++ Dm 
| 


ny! ng! +++ Mm 

(4.12) 
This trial solution corresponds to placing n samples at 
random into the histogram, using sampling probabilities 
(p1,P2,°** ; Pm) for each of the m bins. The probability 
factor p,"! pg”? +++ Dm”™ is the probability of each possi- 
ble way of placing n samples (taking account of the order 
in which the samples are placed), and the multinomial 


factor aul ate ~zt is the number of possible orderings of 


ni n n 
5 ml nm! Pl aay oc 


m 
DPI 


n! 


j=l + Yoie1 (ni + 1) mM Gane eaiionses 


tj 
Cancel the factorials and the probability factors. 


m m ; 
Sop; nyt Son; a (4.14) 
j=1 io. oo 
j a 
tAj 
Solve this equation for the eigenvalue A, and use that 
ie Pi = 1 and YY", nj = n to simplify the result. 


m m m 
A= Sipyng +> ping 
j=1 ae on 
m m m 
= So pj nj + S> ping — Spin; 
j=l i,j=l j=l 
m m 
~ (Som) [Som 
t=1 al 
=n (4.15) 


samples that leave the histogram unchanged (i.e. per- 
mute within bins but not between bins). It is reasonable 
to expect this to be the solution because the effect of 
H. (i.e. Oy", piait 0%", aj) is to randomly annihilate a 
sample from the histogram, and then to create it again 
with probability p; in bin i (which is a memoryless oper- 
ation), so the ~(n1,n2,--- ,Nm) given in Equation 4.12 
should be an equilibrium solution for updates generated 
by H. 


Substitute this trial solution into the consistency equa- 
tion Equation 4.11 to obtain 


(4.13) 


Thus 4 = n which is the (fixed) total number of sam- 
ples in the histogram. The source of this factor is H = 
oe Bi a! yi @j, where each annihilation operator a; 
has n; to choose from in the initial state, so the sum of 
annihilation operators )7'"., a; generates 7" nj = n 
separate contributions. ‘The fact that A is a constant 
means that the consistency equation (i.e. Equation 4.11) 
has an eigenvalue A that is independent of the choice of 
(n1,N2,°++ ,%m), which means that the update operator 
H. has the same effect on each pure state component of 
the equilibrium state WV (as is required in order for W to 
satisfy Equation 4.6). 

The result in Equation 4.15 verifies that the trial so- 
lution proposed in Equation 4.12 is correct, and that 
the equilibrium histogram state corresponds to placing 
n samples at random into the histogram using sampling 
probabilities (p1, p2,--- , Pm) for each of the m bins. 


Summarise these results: 


1. Basic MCMC _ update 
(O51 Bs ag" (Ey a) 


operator: H. = 
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2. General state = (fixed —__n): wv - 
ae Ree Onan tiiabos tin w(n1, N2,°7* , Nm) 
[Tea (ae")™ |0) 

3. Equilibrium condition: 


(dryer Py Oy) as) U = AV 
4. Equilibrium _ state: 
nl 


a n1 2 api 
milnal-= Mm! Pl P2 


w(n1,N2,°°- Nm) = 
Pm” with A=n 


The equilibrium state is a mizture of pure states, where 
each pure state is weighted by the probability of its oc- 
currence. In this approach the state WV of the system 
corresponds to the entire probability-weighted ensemble 
of alternative histograms. In effect, these histograms mix 
with each other under the updating action of the fixed 
external source that causes the samples in the bins of 
each histogram to hop from bin to bin, whilst conserving 
the total number of samples in the histogram (i.e. there 
is migration of samples but no birth or death of sam- 
ples). The equilibrium condition ensures that the mixing 
that occurs due to the hopping of samples has no net ef- 
fect on the probability-weighted ensemble of alternative 
histograms. 

This completes the demonstration that the simplest 
(ie. a single node) multiple occupancy MRF has the 
same properties as ACEnet [6], which is defined as hav- 
ing an equilibrium state that is generated by the random 
(but probability-weighted) placement of n samples into a 
set of histogram bins. Also, larger SONs can be built out 
of multiple linked ACEnet modules, and these correspond 
to MRFs with a larger number of nodes. This unification 
of MRFs and SONs is possible because both approaches 
can be viewed as implementing algorithms for manipu- 
lating samples in histogram bins, and all such algorithms 
can be expressed by using the algebra of creation and an- 
nihilation operators. A key advantage of this MRF/SON 
unification is that the techniques that are used to train 
SONs (i.e. to discover structure in data) can now be 
used to train MRFs, which allows the MRF graph struc- 
ture (ie. nodes and connections) to adapt itself so that 
it is better matched to the data it is trying to model. 

The MCMC updating of MRFs whose nodes are oc- 
cupied by multiple samples potentially leads to lots of 
interesting properties. The derivation above shows how 
a single node MRF behaves under the influence of a fixed 
external source, but more interesting behaviour occurs 
when either the MRF has a single node but the external 
source is variable, or if the MRF has multiple interact- 
ing nodes so that each node sees the variable state of 
the other nodes. This last case is especially interesting 
in MRFs that are trained as SONs, because it leads to 


[1] J Besag, Spatial interaction and the statistical analysis of 
lattice systems, Journal of the Royal Statistical Society: 
Series B 36 (1974), no. 2, 192-236. 
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behaviours in which the samples that occupy the nodes 
act collectively, and thus cause the joint node states to 
behave like extended symbols (see Section IID for some 
diagrams that illustrate this point in more detail). 

V. CONCLUSIONS 


The work described in this paper assumes that Markov 
random field models are used to implement Bayesian in- 
ference. The key contribution of this paper is an im- 
plementation using creation and annihilation operators 
of MCMC algorithms for simulating MRFs. This the- 
oretical framework has a similar structure to that used 
in quantum field theories of bosons in physics [9]. An 
equilibrium solution of the MCMC update operator is 
derived which is shown to be equivalent to the equilib- 
rium behaviour of the adaptive cluster expansion network 
(ACEnet) [6], which is a type of self-organising network 
that computes using discrete-valued quantities. 

This point of contact between MRF theory and SON 
behaviour allows the theories of these two fields to be uni- 
fied. Although MRFs and SONs are superficially different 
(MRF'% have one sample per node, whereas ACEnet SONs 
have multiple samples per node), the underlying oper- 
ators that are used to manipulate them are the same. 
MRF theory could benefit from this unification by be- 
ing able to make use of SONs to build MRF networks 
in a data-driven way. SON theory could benefit from 
this unification by being able to make full use of the rich 
theoretical theory of MRFs. 

It is very convenient that MRFs and SONs are uni- 
fied within a QFT framework, because such theories are 
used extensively by physicists to describe the interac- 
tion of particles, and many techniques have been devel- 
oped to compute results using such theories. We have 
found that it is very easy to transfer knowledge from 
QFT to the unified MRF/SON framework presented in 
this paper. Also, the diagrammatic notation (i.e. Feyn- 
man diagrams) makes it much easier to understand what 
MCMC algorithms are actually doing, without becoming 
submerged in large amounts of theory. 

Future papers in this “discrete network dynamics” se- 
ries of papers will focus in detail on the consequences of 
implementing MCMC algorithms using update operators 
built out of creation and annihilation operators. 
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Guided Tour of Publications * 


Stephen Luttrell 


This is an informal walk-through the research context of all of the papers that I wrote between 
1981 and 2007. Nearly all of the papers are single-authored and interconnected so this document is 
self-contained. 


1981-1982: PhD 


Papers Luttrell and Wada [1], Luttrell et al. [2], Luttrell and Wada [3] (written in collaboration with others) and 
PhD dissertation Luttrell [4] apply quantum chromodynamics to modelling the quark/gluon structure of hadrons seen 
in deep inelastic scattering experiments. Various higher order effects that modify lower order perturbative results 
are studied both phenomenologically and theoretically. The operator algebra techniques used in this PhD research 
unexpectedly turned out to be very useful for the description of Markov chain Monte Carlo methods, as described in 
my publications from 2005 onwards. 


1984-1989: Bayesian Super-Resolution and Mutual Information (scattered field) 


Report Luttrell and Oliver [5] describes how linear least squares error reconstruction using an appropriately weighted 
reconstruction space can be used to introduce prior knowledge to enhance the resolution of coherent images. The 
material in this report was not published because it was quickly overtaken by the full Bayesian treatment described 
below. 


Paper Luttrell [6] describes how Bayes’ theorem may be used to derive the solution to linear inverse problems under the 
assumption of Gaussian probability density functions (PDF). For suitable Bayesian priors this leads to super-resolution 
where details on a scale shorter than the Rayleigh resolution length become visible. 


Book chapter Luttrell and Oliver [7] and report Luttrell [8] present an introduction to the use of prior knowledge in 
the analysis of SAR images. This was patented in Luttrell and Oliver [9]. 


Papers Luttrell [10, 11] introduce the principle of mutual information maximisation as a way of optimising the 
extraction of information from data. 


Paper Luttrell and Oliver [12] is on clutter and targets in SAR images. My contribution showed how to use a 
generalisation of the super-resolution techniques originally developed in Luttrell [6] to analyse images of targets, and 
showed how to use mutual information (as in Luttrell [10, 11]) to interpret the information processing in terms of 
information channels. A brief description of this work is in Luttrell and Oliver [13]. 


Report Luttrell [14] describes an analysis of images of point targets in images produced by the Royal Signals and 
Radar Establishment (RSRE) SAR. The purpose of this was to accurately calibrate the point spread function (PSF) 
of the RSRE SAR so that its images could be robustly super-resolved. However, this analysis led me to the conclusion 
that for the RSRE SAR the PSF was itself a function of the data; in other words the system response was non-linear. 
This totally invalidated the super-resolution theory developed thus far, which assumed that the system response was 
linear, which then triggered a shift of my research away from super-resolution. A side effect of this non-linearity was 
that the higher moments of the image data were biased away from their naive values, and thus should not have been 
used in their uncorrected form to deduce the statistics of the underlying clutter model. 


Report Luttrell [15] discusses the intimate connection between super-resolution and the analysis of phase information 
in SAR data. 


*Typeset in ATRX on May 6, 2019. 
This guided tour was originally written for my curriculum vitae. 
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Paper Delves et al. [16] and Conference papers Pryde et al. [17, 18] describes the results of collaborative work on a 
parallel processing implementation of these super-resolution techniques for analysing synthetic aperture radar (SAR) 
images. 


Book chapter Luttrell [19] (delayed for several years in publication) reviews the ideas behind the use of information 
channels, drawing together ideas from mutual information, super-resolution, and Bayesian inference. 


1985-1988: Markov Random Fields for Clutter and Texture Modelling 


Report Luttrell [20] is the first in a long series of publications on the use of Markov random fields (MRF) to build 
probability density function (PDF) models, which describes of how to train MRF models using a generalisation of 
the Boltzmann machine learning algorithm that I called the Gibbs Machine. Various generalisations were discussed 
such as the idea of enhancing an MRF model by using these learning techniques to attach a “brain graft” to the MRF 
model to patch it up wherever it did not give a good enough approximation to the data. 


Paper Luttrell [21] describes how to do Monte Carlo sampling of arbitrarily complicated MRFs by bit-flipping opera- 
tions that could be easily implemented in hardware. 


Report Luttrell [22] uses elementary PDF methods to measure radar sensitivity in a way that is invariant w.r.t. the 
receiver law. 


Paper Luttrell [23] presents a careful analysis of the use of MRF models for texture modelling, using arguments 
based on sufficient statistics to understand how the texture information is spread out amongst the various measured 
statistics. This is conceptually similar to the use of information channels discussed in Luttrell and Oliver [12]. Contact 
with image processing techniques is made via the grey level co-occurrence matrix method and the WISARD n-tuple 
processing network, both of which are special cases of this MRF analysis. 


Conference papers Luttrell [24, 25, 26, 27| develop various aspects of the use of MRFs for texture and clutter modelling. 


Paper Luttrell [28] shows how the maximum entropy method can be used to pick good statistics to measure in textured 
images (as introduced in Luttrell [23]), so that an efficient MRF texture model can be built. An efficient algorithm 
(which does not require the fixing of Lagrange multipliers, unlike other approaches) is given for deciding what new 
statistics to add to the MRF model by comparing synthetic textures generated using the current MRF model with 
the real texture to be modelled. As a corollary, a generalisation of the original Boltzmann machine learning algorithm 
is given for arbitrary MRFs. 


Conference paper Luttrell [29] compares and contrasts the MRF approach (as in Luttrell [28]) and a proposed hierar- 
chical “cluster-decomposition” approach to modelling PDF's. The generalised form of the Boltzmann machine learning 
algorithm is presented in detail. The advantages of the cluster decomposition approach (particularly that it does not 
use Monte Carlo simulations) are explained. 


Report Luttrell [30] analyses the optimisation of hidden Markov models (HMM) using Gibbs Machine methods. 


Conference paper Luttrell [31] summarises the MRF approach in the context of modelling texture in radar images. 


1989-1991: Bayesian Super-Resolution (scatterers themselves) 


Paper Luttrell [32] extends the earlier Bayesian super-resolution method (see Luttrell [6]) by pointing out the fact 
that the inverse problem to solve should be the reconstruction of the object scatterers that produce the object field, 
and not the reconstruction of the object field itself. In report Luttrell [33] and paper Luttrell [34] this work is greatly 
extended by using the expectation-maximisation (EM) method to derive an iterative object reconstruction algorithm 
from first principles. This allows both super-resolution and auto-focusing to be handled within the same framework. 
This work was reviewed in Luttrell [35]. Report Luttrell [36] shows how the iterated EM method can be applied to 
reconstructing the object scatterers under the assumption that they produce a K-distributed object field. 
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1988-1989: Hierarchical Self-Organising Maps 


Conference paper Luttrell [37] describes a hierarchical generalisation of the standard Kohonen self-organising map 
(SOM) network, and introduces the idea of the “growing grid” method of training SOMs. Conference paper Luttrell 
[38]) and paper Luttrell [39] train a tree of linked SOMs to encode SAR images. In paper Luttrell [40] this approach 
is described in more detail. 


1986-1990: Digital Receiver 


Report Luttrell and Pritchard [41] and paper Luttrell and Pritchard [42] describe the analysis, design and testing of 
a finite state machine for demodulating signals. My contribution was the analysis. This is essentially an early type 
of software radio receiver. This was patented in Luttrell and Pritchard [43]. 


1989-1992: Distortion Minimising Encoders (single SOM) 


Conference papers Luttrell [44, 45] and paper Luttrell [46] present a novel derivation of SOM networks, based on a 
generalisation of the standard Linde-Buzo-Gray (LBG) vector quantisation algorithm to account for distortion on the 
communication channel, which approximates the standard Kohonen SOM algorithm as a special case (the distribution 
of distortions corresponds to the topographic neighbourhood function). The biggest advantage of the approach used is 
that it is based on minimisation of an objective function (unlike Kohonen’s SOM), which thus allows many properties 
to be theoretically derived. The density of code vectors in this type of SOM (for 1-dimensional data) is derived in 
report Luttrell [47] and paper Luttrell [48], where the density is shown to be the same as in a standard vector quantiser 
(unlike the standard Kohonen SOM which leads to a different density which inconveniently depends on the choice of 
topographic neighbourhood function). A generalisation of this result to data of arbitrary dimensionality was given in 
report Luttrell [49]. Report Luttrell [50] shows how to use a vector quantiser to encode data under the assumption 
that its noise is K-distributed. 


1988-1994: Lateral Mutual Information Maximising Hierarchical Encoders 


Reports Luttrell [51, 52] and in the arXiv papers Luttrell [53, 54] show how optimisation of the hierarchical “cluster 
decomposition” introduced in conference paper Luttrell [29] can be used to detect anomalies in textured images. 
Conference paper Luttrell [55] presents a detailed analysis of this optimisation process, where the objective function 
used is the relative entropy between the model PDF and the true PDF, which is equivalent to the sum of the (lateral) 
mutual informations between various nodes in each layer of the network. This approach has been patented Luttrell 
[56]. 


1990-1992: Distortion Minimising Encoders (coarse-grain parallel SOMs) 


Report Luttrell [57] and conference paper Luttrell [58] use a communication channel model of SOMs (see Luttrell 
[46]) to analyse networks of connected SOMs, and this led to paper Luttrell [59]. The model used is one in which 
two communication channels mutually interfere thus causing distortion that is correlated between the channels. This 
distortion defines the topographic neighbourhood functions that are used in the SOMs (that are the channel encoders) 
in the first place. The optimisation of the SOMs is influenced by mutual interactions between their outputs, hence 
the description of this type of training as “self-supervised”. 
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1988-1994: Some Consolidation 


Report Luttrell [60] and conference paper Luttrell [61] analyse adaptive n-tuple networks (e.g. WISARD) by using a 
relative entropy objective function to optimise a PDF model. Both supervised and unsupervised methods of training 
such networks emerge naturally from this analysis. 


Report Luttrell [62] and conference paper Luttrell [63] review the Bayesian approach to training and using neural 
networks. Various models are discussed including the “adaptive cluster expansion” model originally introduced in 
conference paper Luttrell [29]. 


Conference paper Luttrell [64] presents a rigorous Bayesian analysis of the “adaptive cluster expansion” model. 


1992-1994: Partitioned Mixture Distributions 


Report Luttrell [65] and conference papers [66, 67| describe a generalisation (the partitioned mixture distribution, or 
PMD) of the standard mixture distribution used to model PDFs, and this led to paper Luttrell [68]. In image processing 
(the statistical properties of) each correlation area of an image can be modelled with a mixture distribution, and it 
would be convenient if all of these models could be collected together in a single translation invariant architecture. 
A nice solution to this problem is to build each mixture distribution using components selected from a common pool 
of mixture components, where each correlation area has a complete repertoire of components needed for its analysis. 
Mixture distributions that see overlapping areas of image have a lot of overlap in the components they use. A PMD 
can be trained using a generalisation of the EM method for training standard mixture distributions. Trained PMDs 
have many of the properties that are observed in the low-level visual cortex, such as a complete repertoire of processing 
machinery for each local patch of image. 


1990-1994: Folded Markov Chains 


Report Luttrell [69] introduces the folded Markov chain (FMC) technique, and paper Luttrell [70] uses it to give a 
Bayesian analysis of the SOM model originally described in Luttrell [46]. In an FMC information is passed along 
a Markov chain, and then Bayes’ theorem is used to pass (virtually) the information back along the chain in the 
opposite direction until it reaches the start again. This final reconstruction of the input to the chain is compared with 
the original input, and the transitions in the Markov chain are adjusted to minimise the Euclidean reconstruction 
distortion (on average). The tight constraints that Bayes’ theorem imposes on the various probabilities make it possible 
to derive a training algorithm that reduces to the algorithm in Luttrell [46] when the encoder is deterministic. This 
FMC analysis unifies all my earlier work on networks of SOMs, and by allowing probabilistic encoders to be handled 
in the same framework, it sets the scene for future work. 


1993-1995: Some Consolidation 


Reports Luttrell [71, 72, 73] describe various ways in which mixture distributions can be used in the analysis of images. 


Conference paper Luttrell [74] and report Luttrell [75] shows the application of SOM techniques to the analysis of 
data (e.g. range profiles of radar returns from ships) that lies on a manifold with a circular topology. Prior knowledge 
of the topology is used to define an appropriate topographic neighbourhood function for use in the SOM. 


1994-1997: Unification of FMC and PMD 


I celebrate “FMCPMD Day” on 17'> June each year — year zero was in 1994 — in recognition of the day that I finally 
realised that the correct objective function for optimising a PMD should focus on minimising the reconstruction 
distortion rather than optimising the PDF. More generally, distortion-minimisation rather than PDF-optimisation 
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is the correct (and much simpler) way forward, and the reason it took me so long to realise this was because the 
mathematics of PDF-optimisation is so much prettier than the mathematics of distortion-minimisation. 


Book chapter Luttrell [76] (delayed for several years in publication), reports Luttrell [77, 78, 79] and arXiv paper 
Luttrell [80] combine the FMC approach (introduced in Luttrell [70]) with the PMD approach (introduced in Luttrell 
[68]) to create an encoder structure that is appropriate for the analysis of images, where the form of the PMD posterior 
probability allows the encoder to break into local pieces that each encode a local patch of the image. This is then 
applied to the analysis of images derived from multiple sources. In the case of pairs of images the network learns 
a structure that closely resembles the dominance stripes and orientation maps observed in the visual cortex. The 
important point is that you can obtain these “biological” results as a consequence of the general properties of encoders 
— to emphasise this the ancronym VICON (VIsual COrtex Network) was coined in Luttrell [79, 80]. 


Report Luttrell [81] and conference papers [82, 83] present a detailed analysis of some of the properties of PMDs 
(introduced in Luttrell [68]). A dynamical PMD (which has the same relationship to a static PMD that a hidden 
Markov model has to a mixture distribution) is analysed in detail, and is successfully applied to the problem of 
tracking a weak target in clutter. A first order perturbation analysis reveals the low-order properties of PMDs and 
allows their behaviour to be related to various neural networks. 


arXiv paper Luttrell [84], report Luttrell [85], and conference paper Luttrell [86] apply the FMC approach (introduced 
in Luttrell [70]) to optimising encoders that output several statistically independent (given the input) codes. Various 
analytic optimum solutions are obtained. However, these solutions are less useful than those that are obtained if the 
optimisation is constrained in various ways, which is the focus of all subsequent papers in this area. 


1997: “State of the Nation” 


Report Luttrell and Webber [87] summarises all publications produced during the period 1994-1997. 


1997: Isaac Newton Neural Networks Programme 


Report Luttrell [88] and arXiv paper Luttrell [89] describes the work I did at the Neural Networks research programme 
at the Isaac Newton Institute. I analysed the optimisation of an objective function depending on the joint PDF of 
the state of a multi-layer network, and showed that this unified all of my work on the optimisation of hierachical 
encoders. 


arXiv paper [90] summarises various properties of this method of optimising a network. 


1994-1997: Distortion Minimising Encoders (fine-grain parallel) 


Book chapters Luttrell [91, 92] and paper Luttrell [93] give the generalisation of the FMC approach of Luttrell [70] 
to the case where multiple codes are output by a probabilistic encoder for each of its input vectors. Up to this point 
in my work the encoders have each output a single code, and thus each acts as a winner-take-all network. The 
advantage of allowing the possibility of multiple codes is that the encoded information can be split across more than 
one channel. This is conceptually similar to the use of information channels discussed in Luttrell and Oliver [12]. 
Further, if these codes are chosen stochastically, then by optimising the FMC (minimum Euclidean distortion) the 
network can decide for itself how many information channels it needs to use. This freedom allows the network to 
optimise its own architecture. In Luttrell [93] a simple application of these ideas is analysed in detail, where various 
(supposedly) optimal choices of information channel are compared with each other, and the conditions under which 
each type of solution is optimal are established. An important cross-disciplinary result is that the above extension 
to multiple codes allows contact to be made with neural network models based on discrete firing events, where the 
recent history of which neurons have fired corresponds to the set of codes that are currently active. 


Paper Luttrell [94] applies the multiple sampling ideas of Luttrell [93] to the hierarchical vector quantisation network 
Luttrell [40]. Although the hierarchical network of Luttrell [40] has its structure manually defined, rather than learnt 
via optimisation (as could be done using Luttrell [93]), the earlier ideas in [40] automatically fit into the framework 
of the later theory in Luttrell [93]. 
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Change in Publication Style 


From here on there are currently no further journal papers, because delays in publication caused by the refereeing 
process had become excessive. However, the research continues to be published in conference papers and technical 
reports. 


1998: Some Consolidation 


Report Luttrell [95] describes various ways of training self-organising encoder networks by modelling their output as 
the posterior probability over samples (or “firing events”) drawn from a stochastic code book. The different training 
schemes correspond to different conditions under which the samples are drawn. 


1998-1999: Distortion Minimising Encoders (analytic optimisation) 


Conference paper Luttrell [96] shows how analytic optimisation of a stochastic encoder (an FMC with multiple output 
codes) leads to optimal encoders being described by piecewise linear posterior probabilities. This emerges because the 
reconstruction is a linear superposition of contributions, and the objective function is a Euclidean distortion. This 
piecewise property also holds in networks of linked stochastic encoders. This piecewise linearity will prove to be very 
useful for deriving analytic solutions to various problems. 


arXiv paper Luttrell [97], book chapter Luttrell [98], and report Luttrell [99] use the piecewise linear property of opti- 
mal stochastic encoders (see Luttrell [96]) to derive optimal encoding schemes for circular and toroidal manifolds. The 
results derived in this paper were obtained analytically by using Mathematica to manipulate the various cumbersome 
piecewise linear expressions leading to compact results. The circle and the torus are simple curved manifolds that 
serve as models for the more complicated curved manifolds that arise in signal and image processing. For the toroidal 
manifold two types of optimal encoder emerge depending on the size of the encoder and the number of stochastic 
codes that it is allowed to output. On the one hand a joint encoder is optimal if there are more than a certain 
minimum number of codes to choose from; this corresponds to an encoder that has so many resources that it can 
simply partition the torus into localised code cells. On the other hand a factorial encoder emerges if the number of 
codes to choose from is severely reduced and the number of codes it is allowed to output is sufficiently large; this 
corresponds to an encoder that is starved of resources but which is allowed to have several trials at outputting a 
code, and which therefore chooses to use anisotropic code cells to slice the torus into localised code cells defined by 
the regions of intersection of pairs of anisotropic code cells. The automatic emergence of factorial coding is a crucial 
property in a network that aims to discover for itself what information channels to use when processing data. 


1999: Some Consolidation 


Reports [100, 101] motivate the use of self-organising networks for discovering and extracting information in compli- 
cated data, with an emphasis on their potential application to data fusion. Report Luttrell [102] is a user’s guide to 
stochastic encoders, and contains many simple examples of their use. arXiv paper Luttrell [103] is a comprehensive 
summary of the theory and many numerical simulations of stochastic encoders, and arXiv paper Luttrell [104] contains 
a small subset of these results. 


2000: “State of the Nation 2” 


Report [105] summarises all publications produced during the period 1997-2000. 
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2000-2003: Distortion Minimising Encoders (jammer suppression) 


Conference paper Luttrell [106] applies a stochastic encoder to the problem of separating a signal from a jammer. 
Because the jammer is much stronger than the signal the optimisation concentrates on encoding the jammer alone. 
This allows the trained encoder to separate the jammer and signal subspaces (in this case these are curved subspaces), 
which allows the signal to be cleanly detected. This material is also covered in report Luttrell [107], conference papers 
Luttrell [108, 109], arXiv papers (extended versions of the preceding conference papers) Luttrell [110, 111], and book 
chapters [112, 113]. 


2001: Distortion Minimising Encoders (classifier fusion) 


Conference paper Luttrell [114] and workshop paper [115] demonstrate some of the key properties of stochastic 
encoders. One of these is their ability to learn factorial codes in which a code book breaks into separate smaller code 
books, each of which encodes only one part of the input. Another is the natural way in which supervision can be 
introduced to steer the code books towards coding their inputs in particular ways. This is all presented in the context 
of multiple classifier fusion, where the processing is split across multiple information channels and then fused together 
again at the end. 


2001-2004: Internal Reports (part 1) 


Report Luttrell [116] describes how to use a stochastic encoder to suppress sea clutter in coherent images, which allows 
targets that are masked by the clutter to be revealed. The approach used is essentially the same as for separating a 
signal from a jammer (as in conference paper Luttrell [106]). 


Report Luttrell [117] compares and contrasts the use of encoders and the use of PDFs for modelling (and suppressing) 
sea clutter in images. 


Reports [118, 119] summarise all of my research results on self-organising networks. 


Reports Luttrell [120, 121] describe various interesting and useful properties of the self-organising network known as 
ACEnet (Adaptive Cluster Expansion network). 


Reports Luttrell [122, 123, 124] describe how to apply self-organising networks to the problem of combining the 
outputs of multiple classifiers. 


2003: Isaac Newton Neural Networks Programme (published, at last!) 


Conference paper Luttrell [125] introduces a new way of understanding networks of linked stochastic encoders. These 
ideas were originally described in report Luttrell [88]. Rather than modelling each encoder as an FMC (see Luttrell 
[70] and its generalisation to multiple codes in Luttrell [93]) and building up the overall network objective function 
as a sum of local contributions, the whole network of linked encoders is deemed to have a single objective function 
which compares the joint PDF of the whole network under two different modelling assumptions. For a chain-like 
network the two PDF models are built from conditional probabilities acting in opposite directions along the chain, 
and it is similar to Dayan and Hinton’s Helmholtz machine except that the aim here is to optimise the joint network 
probability rather than only the marginal input probability. This approach is then shown to reduce to the FMC 
approach wherever that had been used earlier. An application of these ideas to training a network on hierarchically 
correlated data is presented, and the encoders (the bottom-up PDF models) automatically split into smaller encoders 
in precisely the way that is expected (i.e. process the most uncorrelated pieces separately, then progressively fuse 
the results until only the most correlated piece is left). This approach is thus capable of making structural changes 
to the network as it is trained, where it creates information channels for doing the lower-level processing, and then 
progressively fuses these channels to do the higher-level processing. 
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2003-2004: Distortion Minimising Encoders (multiple correlation scales) 


arXiv paper Luttrell [126] describes the use of self-organising stochastic encoders for learning the structure of data 
manifolds. The main aim of this paper was to demonstrate the automatic separation of correlated components in 
the data, and their subsequent fusion whilst preserving only their dominant degree(s) of freedom, as was described in 
Luttrell [125]. This is a key paper that opens up a very fruitful area of research. 


2005: Discrete Network Dynamics 


arXiv paper [127] describes the use of operator algebra techniques to reformulate MCMC algorithms, so that the 
algorithms can be manipulated by purely algebraic methods, which would have applications in adaptive networks. 
This is a key paper that opens up a very fruitful area of research. 


2005-2007: Internal Reports (part 2) 


Report Luttrell [128] comprehensively reviews MRFs and SONs from the perspective of adaptive filters and recur- 
rent networks, and introduces an operator algebra to describe the associated Markov chain Monte Carlo (MCMC) 
algorithms. 


Reports Luttrell [129, 130] apply the operator algebra methods of [127, 128] to recurrent networks, dealing with both 
the small-occupancy (i.e. dominated by quantisation effects) and large-occupancy (i.e. quantisation effects negligible) 
cases. 


Report Luttrell [131] describes a proposed application of recurrent SONs to the problem of language identification. 


Report Luttrell [132] introduces symbolic algebra techniques for manipulating the operator algebra of MCMC algo- 
rithms. 


Report Luttrell [133] presents a Mathematica implementation of recurrent SONs. 


The appendix of report [134] presents some ideas on how to use operator algebra methods and self-organising networks 
to learn the structure of networks. 
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